Linux PC Benchmarks
Contents
General
Both 32-Bit and 64-Bit versions of Ubuntu Linux were installed on an eSATA/USB hard disk and on USB Flash drives, to compile and assemble
existing PC benchmarks
via the compiler and assembler that are included in the package. The booting method used also enabled loading Ubuntu on a range of different PCs and laptops.
The benchmark programs, including source code and compile/link commands, are compressed in .tar.gz format. Copy the latter to your home directory or subdirectory for extraction. Examine the README file for further directions. The benchmarks are simple execution files and do not need installing. The first ones run in a Terminal window via the normal ./name command or via clicking on a shell script, containing the commands. Details are displayed when the tests are running and performance results are save in a .txt file.
The benchmarks were recompiled via Ubuntu 14.04 via GCC 4.8.2 that can handle later Intel CPU instructions, including AVX1 and results are included below.
When recompiled benchmarks produced significant different results to the older ones, they are available in
AVX_benchmarks.tar.gz.
This also contains source codes with changes that enable error free compiling and correct execution.
Further details are in
Linux AVX benchmarks.htm.
Latest results are for a quad core/8 thread 3.7 GHz Core i7 4820K with 10 MB L3 cache, normally running at Turbo Burst speed of 3.9 GHz. It has 32 GB DDR3 RAM on 4 memory channels with maximum speed of 800 MHz (bus speed) x 2 (DDR) x 4 (channels) x 8 (bus width) or 51.2 GB/second.
To Start
Configuration Details
All benchmarks include the same configuration details, some of which is produced via assembly language code. Example details shown are for an AMD Phenom quad core processor via 32 -Bit Ubuntu and an Intel Core 2 Duo using the 64-Bit version.
######################################################################
Assembler CPUID and RDTSC
CPU GenuineIntel, Features Code BFEBFBFF, Model Code 000306E4
Intel(R) Core(TM) i7-4820K CPU @ 3.70GHz
Measured - Minimum 3711 MHz, Maximum 3711 MHz
Linux Functions
get_nprocs() - CPUs 8, Configured CPUs 8
get_phys_pages() and size - RAM Size 31.51 GB, Page Size 4096 Bytes
uname() - Linux, roy-WD32, 2.6.35-24-generic-pae
#42-Ubuntu SMP Thu Dec 2 03:21:31 UTC 2010, i686
Assembler CPUID and RDTSC
CPU AuthenticAMD, Features Code 178BFBFF, Model Code 00100F42
AMD Phenom(tm) II X4 945 Processor
Measured - Minimum 2978 MHz, Maximum 3008 MHz
Linux Functions
get_nprocs() - CPUs 4, Configured CPUs 4
get_phys_pages() and size - RAM Size 7.88 GB, Page Size 4096 Bytes
uname() - Linux, roy-C2D, 2.6.35-22-generic-pae
#35-Ubuntu SMP Sat Oct 16 22:16:51 UTC 2010, i686
Assembler CPUID and RDTSC
CPU GenuineIntel, Features Code BFEBFBFF, Model Code 000006F6
Intel(R) Core(TM)2 CPU 6600 @ 2.40GHz
Measured - Minimum 2407 MHz, Maximum 2407 MHz
Linux Functions
get_nprocs() - CPUs 2, Configured CPUs 2
get_phys_pages() and size - RAM Size 3.87 GB, Page Size 4096 Bytes
uname() - Linux, roy-64Bit, 2.6.35-22-generic
#33-Ubuntu SMP Sun Sep 19 20:32:27 UTC 2010, x86_64
Identified with Fedora Linux
uname() - Linux, localhost.localdomain, 2.6.34.7-61.fc13.x86_64
#1 SMP Tue Oct 19 04:06:30 UTC 2010, x86_64
######################################################################
|
To Start
32-Bit and 64-Bit Differences
The main advantage of 64-Bit working is that the amount of main memory installed and accessible is much larger that 32-Bit operation. The downside can be worse performance if integer array variables are defined as 64 bits, leading to twice the data volumes being read and written.
The original x87 floating point instructions are not available using 64-Bit compilations. Instead, SSE instructions are used for 32-Bit Single Precision (SP) floating point numbers and SSE2 for 64-Bit Double Precision (DP). These are potentially Single Instruction Multiple Data (SIMD) instructions, where four SP results or two DP results can be produced per clock cycle and, even adds and multiplies linked, with eight or four results. Unfortunately, it seems that only Single Instruction Single Data (SISD) operations are issued, where only one number is used in the 128 bit registers, and this can lead to slower performance than a program compiled for 32-Bits with x87 instructions.
The main performance gains at 64-Bits appears to be the provision of twice as many general purpose and SSE registers where, with optimisation options, provides faster speeds through reducing the need to save and reload variables that involve access to slower memory.
Some of these for better and for worse results are reflected in the tables below.
To Start
Classic Benchmarks
The Classic Benchmarks are the first programs used to measure relative performance of computers. They are:
Livermore Kernels (Livermore Loops) - Produced for the first supercomputers and comprising 14 kernels in 1970, then 24 in the 1980s. The 24 kernels are run at three different data sizes. Results are in Millions of Floating Point Operations Per Second (MFLOPS) with one measurement for each kernel and some overall figures, where Geometric Mean is the official overall rating.
Whetstone Benchmark - the first general purpose benchmark that set industry standards of performance, particularly for minicomputers, and introduced in 1972. The benchmark produced speed ratings in terms of Thousands of Whetstone Instructions Per Second (KWIPS). In 1978, self timing versions (by yours truly) produced speed ratings, for each of the eight test procedures, in MOPS (Millions of Operations Per Second) or MFLOPS, with an overall rating in MWIPS.
Dhrystone Benchmarks 1.1 and 2.1 - The Dhrystone benchmark, a sort of Whetstone without floating point, became the key standard benchmark, from 1984, with the growth of Unix systems. The second version (2.1) was produced to avoid over-optimisation problems encountered with version 1.1. Original performance ratings were in terms of Dhrystones per second. This was later changed to VAX MIPS by dividing Dhrystones per second by 1757, the DEC VAX 11/780 result.
Linpack Benchmark - This benchmark was produced from the "LINPACK" package of linear algebra routines. It became the primary benchmark for scientific applications from the mid 1980's with a slant towards supercomputer performance, with speed measured in MFLOPS.
Further details and references can be found in
classic.htm
On starting execution, the programs go through a calibration phase to determine the number of passes to run for more than 2 seconds with Dhystone, 1 second for each of 8 tests with Linpack, 1 second for each of 72 tests with Livermore Loops and 10 seconds overall with Whetstone. Displayed results demonstrate that running time is proportional to the number of passes.
For the benchmark execution codes and source files, download
classic_benchmarks.tar.gz.
Four execution files are provided for each benchmark. They comprise 32-Bit and 64-Bit compilations, non-optimised and optimised varieties.
On downloading to Windows, the file appeared as classic_benchmarks.tar.tar but seemed to be fine with the name changed to classic_benchmarks.tar.gz.
To Start
Classic Benchmark Results
Results of these Linux based benchmarks are included with those run via Windows in the following reports. Some examples are given below, all for using 1 CPU of a 2.4 GHz Core 2 Duo and 2014 speeds of a 3.7 GHz Core i7, running at the Turbo Boost speed of 3.9 GHz.
The benchmarks were recompiled via Ubuntu 14.04 via GCC 4.8.2 that can handle later Intel CPU instructions, including AVX1 and results are shown below (New x64 with SSE/SSE2 and New AVX). The Core i7 maximum speed in GFLOPS per core (4 available) is GHz x 4 (SSE single precision) x 2 (with multiply and add) or 31.2 GFLOPS and 62.4 using AVX1. Using double precision, the best possible scores are 15.6 and 31.2 GFLOPS respectively.
The only real beneficiary of the recompilation is the Linpack benchmark via AVX options. Some of the Livermore Loops should benefit but via the really simple structure used but this is presently beyond the capabilities of the compiler.
Whetstone Benchmark Optimised
MWIPS MFLOP MFLOP MFLOP COS EXP FIXPT IF EQUAL
1 2 3 MOPS MOPS MOPS MOPS MOPS
2.4 GHz Core 2
32 Bit 2280 815 811 576 56.5 22.6 4011 7413 3651
64 Bit 2560 865 885 589 65.7 29.1 3851 5314 1078
3.7 GHz (TB 3.9) Core i7
32 Bit 3959 1331 1331 938 97 42.1 6516 10967 5851
64 Bit 4880 1331 1324 977 129 64.2 6517 11657 1812
New x64 4891 1330 1323 977 120 64.5 6505 11638 3903
New AVX 4897 1325 1323 977 120 64.5 6515 11649 3909
Livermore Loops MFLOPS 24 Kernels Optimised
Loop
1 2 3 4 5 6 7 8 9 10 11 12
13 14 15 16 17 18 19 20 21 22 23 24
2.4 GHz Core 2
32 Bit 1953 1223 1584 1534 343 1238 2192 2385 2147 1187 795 479
161 396 276 956 1368 959 509 385 1385 165 1182 560
64 Bit 1702 1340 1593 1531 341 1199 2422 3060 2057 770 798 861
481 673 444 992 1029 1222 461 423 1251 351 1184 819
3.7 GHz (TB 3.9) Core i7
32 Bit 4327 3661 2622 2642 527 2250 4217 5549 5223 2511 1311 1279
450 1036 730 2038 2479 2835 810 783 2820 419 2022 967
64 Bit 4707 3434 2629 2657 565 2155 4592 6131 5442 2602 1314 1296
937 1239 2288 2293 2392 3538 839 968 2792 939 2034 1720
New x64 4729 3422 2639 2657 565 2164 4599 5714 4984 2446 1310 1879
1018 1267 2287 2012 2397 5343 836 969 3042 940 2011 1840
New AVX 4692 3488 2638 2654 564 2160 4471 5717 4978 2619 1308 1863
978 1305 2285 2043 2492 6418 836 968 3069 938 2010 1558
Dhrystone Linpack
Dhry1 Dhry1 Dhry2 Dhry2
NoOpt Opt NoOpt Opt
VAX VAX VAX VAX No Opt Opt
MIPS MIPS MIPS MIPS MFLOPS MFLOPS
2.4 GHz Core 2
32 Bit 3428 13599 3348 5852 404 1288
64 Bit 3643 18738 3288 12265 378 1577
3.7 GHz (TB 3.9) Core i7
32 Bit 7108 29277 7478 16356 988 2534
64 Bit 8436 32659 8481 23607 900 3672
New x64 8441 32499 8381 24140 946 3631
New AVX 8441 32575 8395 23626 935 5413
|
To Start
Maximum CPU Speeds
Benchmarks whatcpu32 and whatcpu64 are essentially the same as cpuid and cpuid64, produced for Windows, with description and results in
WhatCPU results.htm.
The programs were written with a view towards demonstrating maximum CPU performance executing all types of arithmetic instructions. The execution files and source code are available for download in
max_cpu_speeds.tar.gz.
The benchmark programs use assembler level instructions, including full SIMD operations where appropriate, to simply add values via 1, 2, 3 and 4 registers. Results are in MIPS and MFLOPS, millions of adds per second in both cases. The programs also check that the end totals are correct. The 32 bit version adds 32 bit integers, then 32 bit single precision and 64 bit double precision floating point numbers using the original x87 instructions. This is followed by adding 32 bit integers using MMX and SSE2 instructions and 64 bit integers also using SSE2 functions. Finally there are 32 bit floating point additions using SSE instructions plus 3DNow, using AMD processors, and 64 bit floating point sums with SSE2 operations.
MMX, x87 and 3DNow instructions are not available at 64 bit working, but normal integer instructions are provided to use 64 bit numbers which, in the case of this register based program, mainly run at the same speed as with 32 bit arithmetic.
Results below are for an AMD Phenom X4 and Intel Core 2 Duo, using one CPU in each case. These suggest three integer adds and two 64 bit MMX operations can be executed per clock cycle. Then SSE/SSE2 floating point calculation speed is based on one 128 bit register dealt with per cycle. Best is eight 32 bit SSE integer adds per cycle.
Here, the AMD processor appears to be more efficient than the Intel CPU, but later Intel i7
32 bit and
64 bit
results correct some of this anomaly.
Results from a later Core i7 are also shown. This CPU has AVX1 instructions included, with 256 bit registers, producing up to eight 32 bit floating point results per CPU cycle (31.2 GFLOPS at 3.9 GHz), on addition and twice this with linked multiply and add instructions. The latter were included in a new AVX test (AVXid64), demonstrating 62 GFLOPS at 3.9 GHz. Details of the latter can be found in
Linux AVX benchmarks.htm.
Word 32 bit OS Version 64 bit OS Version
Size 1 Reg 2 Reg 3 Reg 4 Reg 1 Reg 2 Reg 3 Reg 4 Reg
Core i7 3.7 GHz
at up to 3.9 GHz
via Turbo Boost
32 bit Integer MIPS 4301 8551 11994 12292 4302 8559 11996 12293
64 bit Integer MIPS - - - - 4302 8553 11996 12293
32 bit x87 MFLOPS 1303 2607 3865 3864 - - - -
64 bit x87 MFLOPS 1303 2607 3865 3864 - - - -
32 bit MMX Int MIPS 7822 14900 14932 14900 - - - -
32 bit SSE2 Int MIPS 15642 29800 29868 29800 15643 29802 29870 29805
64 bit SSE2 Int MIPS 7821 14899 14934 14900 7822 14901 14935 14901
32 bit SSE MFLOPS 5214 10427 15459 15457 5214 10429 15460 15459
64 bit SSE2 MFLOPS 2607 5214 7730 7729 2607 5215 7730 7729
32 bit 3DNow MFLOPS - - - - - - - -
32 bit AVX1 MFLOPS - - - - 10430 20860 - 30920
64 bit AVX1 MFLOPS - - - - 5210 10430 - 15460
32 bit AVX1 +* MFLOPS - - - - - - - 62000
64 bit AVX1 +* MFLOPS - - - - - - - 31000
Phenom II X4
3.0 GHz
32 bit Integer MIPS 3314 6629 8664 9040 3315 6629 8664 9040
64 bit Integer MIPS - - - - 3315 6629 7701 8287
32 bit x87 MFLOPS 753 1506 2259 3013 - - - -
64 bit x87 MFLOPS 753 1506 2259 3013 - - - -
32 bit MMX Int MIPS 3012 6026 9036 12054 - - - -
32 bit SSE2 Int MIPS 6024 12050 18073 24107 6025 12053 18081 24107
64 bit SSE2 Int MIPS 3012 6025 9037 12053 3013 6027 9040 12053
32 bit SSE MFLOPS 3012 6024 9037 12050 3013 6025 9040 12053
64 bit SSE2 MFLOPS 1506 3012 4518 6025 1506 3013 4519 6027
32 bit 3DNow MFLOPS 1506 3012 4518 6025 - - - -
Core 2 Duo
2.4 GHz
32 bit Integer MIPS 2629 4915 5356 6605 2601 4410 5226 6606
64 bit Integer MIPS - - - - 2612 3908 5525 5285
32 bit x87 MFLOPS 801 1601 2402 2402 - - - -
64 bit x87 MFLOPS 801 1601 2402 2402 - - - -
32 bit MMX Int MIPS 4726 7116 8772 8734 - - - -
32 bit SSE2 Int MIPS 9490 13769 17545 17469 9490 14641 17527 17471
64 bit SSE2 Int MIPS 2402 4575 4586 4575 2402 4576 4585 4576
32 bit SSE MFLOPS 3202 6405 9608 9608 3202 6405 9609 9609
64 bit SSE2 MFLOPS 1601 3202 4804 4804 1601 3202 4804 4804
32 bit 3DNow MFLOPS - - - - - - - -
|
To Start
OpenMP Benchmark
OpenMP is a system independent set of procedures and software that arranges automatic parallel processing of shared memory data when more than one processor is provided. This option is available in the C/C++ compiler included in the Linux Ubuntu Distribution.
In each case, four benchmarks are provided, compiled with and without OpenMP options, to run on 32 bit and 64 bit systems.
The execution files and source code along with compile and run instructions can be downloaded in
linux_openmp.tar.gz.
Details and results are provided in
linux_openmp benchmarks.htm
and a summary follows.
Original OpenMP Benchmark
The original benchmark
used larger data array sizes of 0.4, 4.0 and 40 MBytes with 2, 8 and 32 floating point calculations per word (4 Bytes). The 32 bit version behaved in a similar way to the Windows compilation, showing performance gains of a four core processor of up to four times that of a single CPU.
The 64 bit OpenMP version behaved in a similar manner to the 32 bit variation but appears to be relatively worse on comparing with speeds produced by the normal compilation.
The reason is that the latter produces full SIMD operation, with four calculations per clock cycle, and the former SISD with one calculation per clock.
(See above, where SIMD was not produced). Examples of results are given below.
Later results are for the 64 bit version running on the Core i7. In this case, for comparative purposes, those obtained by a multithreading version are also shown. This is MP MFLOPS - (see below). Also included is the non-OpenMP version and new compilations for SSE and with AVX functions.
The former produces the same speeds as MP MFLOPS using one thread, with maximum speed of around 24.5 GFLOPS for one thread, demonstrating SIMD, where the maximum possible is 31.2 GFLOPS [CPU GHz x 4 (register width) x 2 (linked multiply and add)]. Performance of 4 way MP MFLOPS speeds show appropriate gains, to produce up to 93.2 GFLOPS, but could require the use of the 8 threads available via Hyperthreading. The AVX benchmark shows suitable gains, with 8 word registers, where the maximum demonstrated is 177.8 GFLOPS.
Note that the i7 SSE OpenMP speeds, shown below, are from a recompiled version by GCC 4.8.2, as this produces SIMD instructions. The new versions are included in
AVX_benchmarks.tar.gz,
along with the new AVX benchmark. The original SSE version, in
linux_openmp.tar.gz,
produces SISD instructions and maximum speeds shown underneath the i7 table.
The new compilations produce SIMD instructions for 2 and 8 operations per word, but performance is degraded due data handling overheads. Then, at least, AVX scores are double those produced via SSE arithmetic. All the complex data handling seems to lead to SIMD instructions not being generated for the 32 operations tests, leading to SSE and AVX speeds being the same (single data word handling).
Linux OpenMP MFLOPS 3 GHz Quad Core Phenom
32 Bits 64 Bits
Data Ops/ 1 CPU 1 CPU 2 CPUs 4 CPUs 1 CPU 1 CPU 2 CPUs 4 CPUs
Words Word *Norm OMP OMP OMP *Norm OMP OMP OMP
100000 2 2439 1903 3575 5758 7624 1974 3597 5769
1000000 2 2231 1787 3588 6710 4686 1913 3843 6674
10000000 2 1739 1509 2490 3062 2195 1590 2566 2944
100000 8 3348 3518 6963 13353 14357 3437 6835 12126
1000000 8 3195 3453 6943 13524 13376 3375 6802 12420
10000000 8 3080 3308 6541 11311 7473 3219 6379 10976
100000 32 3881 3794 7566 14896 15336 3552 7084 13494
1000000 32 3853 3774 7554 14969 15009 3533 7079 13540
10000000 32 3817 3735 7465 14883 14318 3490 6970 13450
Instructions FPU FPU FPU FPE SIMD SISD SISD SISD
x87 x87 x87 x87 SSE SSE SSE SSE
*Norm OpenMP Directives not used - 1 CPU core SSE
Core i7 3.7 GHz at up to 3.9 GHz via Turbo Boost
----- MP MFLOPS 1 to 8 Threads ----- -------- OpenMP ---------
----- SSE ----- ----- AVX ----- SSE --- SSE --- --- AVX ---
M 4B Ops 1 4 8 1 4 8 1* aff1 8 aff1 8
Words Word
## ##
0.1 2 9681 45340 54621 12542 62273 60258 9918 6061 13742 10196 19577
1.0 2 9759 21688 41832 11404 23031 44329 9688 6215 19477 10025 37906
10.2 2 5990 9237 10026 5991 8970 9977 5870 5059 9137 5880 7782
0.1 8 24533 49320 92086 35982 159040 173224 24448 13220 44104 26481 88370
1.0 8 24570 49918 92352 36180 80096 151909 24465 13373 49499 27045 90579
10.2 8 19975 36638 39982 23299 40124 40153 20055 12719 38369 20593 35607
0.1 32 23269 46942 92408 46400 90572 173372 23251 5854 22858 5865 22845
1.0 32 23307 89676 93282 46572 91058 177831 23265 5863 23234 5870 23141
10.2 32 23052 91029 92050 44729 88877 158594 23063 5860 23127 5854 23077
2&8 Ops ------- SIMD ------ ------- SIMD ------ SIMD --- SIMD -- --- SIMD --
32 Ops ------- SIMD ------ ------- SIMD ------ SIMD --- SISD -- --- SISD --
## new version, Original SISD all cores - 2 Ops 3400, 8 Ops 6100, 32 Ops 5900
|
To Start
OpenMP MemSpeed Benchmark
MemSpeed benchmark
employs three different sequences of operations, on 64 bit double precision floating point numbers, 32 bit single precision numbers and 32 bit integers via two data arrays.
It uses data volumes of 4 KBytes upwards to indicate performance via caches and RAM.
This version is a variation with evaluation mainly concentrating on the formula x[m] = x[m] + r * y[m].
Below is a sample log file with the 64 bit benchmark using four CPUs. The extremely slow performance at the smaller data sizes is due to the relatively high startup overheads of OpenMP and,
probably, cache flushing because shared data is being updated.
The 32 bit version produces even slower performance relative to the non-OpenMP compilation.
See also Multithreading version.
Selected results for the Core i7 include those for the benchmark, compiled without OpenMP directives, plus with and without OpenMP, produced by the later compiler that generates AVX instructions. The CPU has 4 cores plus Hyperthreading.
The non-OpenMP versions are compiled to use SIMD instructions, but performance is restricted due to overheads of loading, storing and inserting data. With these, AVX produced suitable gains for cache based data. SISD was generated by OpenMP compilations, leading to SSE and AVX speeds being the same. At least, many MP speeds were appropriately faster than those for single core tests, and maximum memory throughput was excellent. Further details are in
OpenMP MFLOPS.htm.
The same program was compiled using Pthread multithreading functions see
See MP Memory Speed Later
Phenom II X4 3000 MHz OpenMP
Memory Reading Speed Test 64 Bit Version 1 by Roy Longbottom
Start of test Sun Dec 5 12:26:36 2010
Memory x[m]=x[m]+s*y[m] Int+ x[m]=x[m]+y[m] x[m]=y[m]
KBytes Dble Sngl Int64 Dble Sngl Int64 Dble Sngl Int64
Used MB/S MB/S MB/S MB/S MB/S MB/S MB/S MB/S MB/S
4 2413 2340 2426 2408 2371 2593 1301 1302 1306
8 4642 4379 4655 4739 4488 5045 2562 2478 2583
16 8321 7942 8513 9215 8412 9668 4989 4695 4982
32 15714 12698 15446 16397 14036 17359 9112 7963 9159
64 25533 18268 24526 26971 21394 28979 16033 12269 16032
128 36147 23064 34023 40018 28460 42871 23255 16389 23172
256 45821 26908 42782 21679 34353 57114 31501 20370 31889
512 46924 28555 46191 55514 35557 54808 33583 22754 33376
1024 45478 28681 45098 48798 34662 47103 25081 22172 24993
2048 36642 26993 36187 36523 32366 36917 18354 17985 18388
4096 30960 24342 30259 32057 26483 32862 17172 15049 17153
8192 22963 20257 22754 23462 21376 23910 12203 11223 12176
16384 8927 8774 8888 8947 8803 8951 4469 4454 4487
32768 8938 8817 8875 8963 3681 8964 4494 4465 4488
65536 8956 8863 8910 8959 8849 8981 4500 4474 4502
131072 8979 8918 8951 8830 8808 9022 4513 4494 4517
262144 8784 8657 8706 8760 8826 8919 4436 4422 4433
524288 8774 8478 8789 8732 8643 8864 4374 3703 4435
1048576 8664 8559 8617 8689 8612 8678 4368 4360 4336
2097152 8661 8631 8643 8611 8597 8692 4364 4368 4367
Core i7 4820K 3900 MHz Turbo Boost - x[m]=x[m]+s*y[m] Int+
64 bit 64 bit OMP 64 bit AVX 64b AVX OMP 32 bit
KBytes Dble Sngl Dble Sngl Dble Sngl Dble Sngl Dble Sngl
Used MB/S MB/S MB/S MB/S MB/S MB/S MB/S MB/S MB/S MB/S
4 39311 24057 2666 2628 60212 56633 2716 2670 34682 17663 L1
8 39076 24566 5058 4962 61736 58608 5163 5100 35342 17780
16 39851 24795 9662 9412 62061 59459 9818 9526 35555 17824
32 39859 24862 18780 17122 61951 59466 19317 17272 35391 17781
64 32844 24462 33953 26599 47441 40896 34221 26564 30900 17303 L2
128 32879 24498 51235 36875 46181 40101 52329 37762 31022 17313
256 30516 23886 70872 47353 41612 36928 71102 47183 29852 17324
512 25604 22420 90020 53395 31463 30294 90080 54397 24994 17127 L3
1024 25565 22368 97333 57510 30155 29099 97835 57372 24903 17129
2048 25589 22479 96621 58092 30044 29144 93511 58513 24909 17125
4096 25600 22405 87122 60230 30056 29218 93758 60141 24951 17194
8192 25593 22460 94138 60267 29891 29223 104996 59273 24864 17203
16384 15083 14415 27817 27128 15577 15790 27302 27169 14951 13705 RAM
32768 14845 14293 24666 24563 15191 15371 24620 24175 14890 13704
65536 14959 14424 24868 25137 15215 15401 24763 24725 14856 13695
131072 15041 14492 25625 25696 15230 15401 25636 25597 14880 13726
262144 15023 14491 25603 25435 15247 15410 25507 25348 14958 13773
524288 15053 14520 25603 25634 15204 15445 25646 25396 15016 13824
1048576 15085 14534 25569 25690 15198 15438 25160 25678 15025 13846
2097152 15096 14538 25634 25814 15254 15462 25656 25700
4194304 15096 14544 25344 25266 15252 15452 25413 25421
Max GFLOPS 5.0 6.2 12.2 15.1 7.8 14.9 13.1 15.0 4.4 4.5
|
To Start
BusSpeed Benchmark
This benchmark is particularly designed to identify reading data in bursts over buses, with a 32 bit version using 32 bit integer words and one for 64 bits using 64 bit numbers. The program starts by reading a word, with address increments of 32 words for the next data. The increment is reduced to 16 words then halving until all data is read. The last test reads all data but using SSE2 instructions.
Below are 64 bit results on a Core i7, a Core 2 Duo, with sample results at 32 bits and both varieties on a Phenom processor. The data burst size over the memory bus is indicated at the point where performance becomes constant, like Inc8wds at 64 bits and Inc16wds at 32 bits, both suggesting 512 bits or 64 bytes. Burst reading speed is eight times the constant speed at 64 bits and 16 times at 32 bits, or around 6400 MB/second for the Core 2 Duo and 7200 for the Phenom. There also appears to be some burst reading from data in L2 cache.
Speeds via L1 cache are fairly constant up to ReadAll, indicating no burst reading but, with the data transfer speed at 32 bits being twice that for 64 bits, a constant instruction execution speed is suggested. This, in MIPS, is slightly less than CPU MHz for the Core 2 Duo and somewhat higher than MHz on the Phenom. The SSE2 test is identical at both bit versions with the Core 2 Duo showing better efficiency at nearly four 32 bit results (1 SSE register full) per CPU clock cycle.
Maximum speed of the Core i7, based on burst speed, is suggested to be around 18 GB/second, a long way fro the 51,2 GB/second specification, but i7 Multithreading Benchmarks (below) are needed to approach this.
The 32 bit and 64 bit benchmarks, source code and instructions can be downloaded in
memory_benchmarks.tar.gz.
with more details and results in
Linux Results BusSpeed
Speed in MB/Second - For MIPS 64 bit divide by 8 and 32 bit divide by 4
Core i7 4820K 3900 MHz Turbo Boost - 1 CPU
Bus Speed Test 64 bit Version 2.0 Sat Nov 8 12:08:24 2014
Kbytes Inc32wds Inc16wds Inc8wds Inc4wds Inc2wds ReadAll 128bSSE2
6 31233 31271 31267 42205 38182 42586 61438 L1
24 31300 31277 31262 41632 39363 42724 62272
96 14511 15005 15180 24371 33172 40471 60769 L2
384 5367 5423 5502 10797 19594 33646 39043 L3
768 5280 5366 5435 10797 19322 33431 38081
1536 5247 5348 5493 10799 19399 33625 38234
16380 1282 1569 2170 4762 9130 18547 19124 RAM
131070 1223 1484 2098 4543 8731 18096 18349
393210 1223 1484 2098 4542 8733 18095 18344
Bus Speed Test 32 bit Version 2.0 - L1 cache, L2 cache and RAM
6 15308 15463 20502 18262 20300 21300 60627
96 7434 7593 11491 16540 20013 21082 60633
1536 2677 2757 5381 9694 16801 21026 38206
393210 742 1048 2245 4360 9063 16342 18263
Core 2 Duo 2400 MHz - 1 CPU
Bus Speed Test 64 bit Version 2.0 Thu Dec 16 23:09:19 2010
Kbytes Inc32wds Inc16wds Inc8wds Inc4wds Inc2wds ReadAll 128bSSE2
6 15997 17525 18167 18540 18734 18804 37355
24 17759 18484 17865 17822 18531 18526 37980
96 4189 4158 4107 6724 9128 13435 19175
384 4182 4137 4091 6721 9133 13450 19206
768 4109 4123 4094 6723 9129 13448 19229
1536 3883 4086 4039 6643 9011 13280 18913
16380 657 691 800 1626 2949 5445 5882
131070 693 711 803 1622 2942 5440 5874
393210 698 713 803 1623 2948 5444 5865
Bus Speed Test 32 bit Version 2.0 - L1 cache, L2 cache and RAM
6 8568 9076 9176 9315 9412 9433 37350
96 2112 2053 3277 4561 6714 8097 19170
393210 356 401 815 1474 2730 5091 5870
Phenom II X4 3000 MHz - 1 CPU
Bus Speed Test 64 bit Version 2.0 - L1 cache, L2 cache and RAM
6 21407 22690 26285 27053 27050 26435 23784
96 2992 2973 2991 5992 11780 20725 23813
393210 869 901 918 1791 3729 6264 7391
Bus Speed Test 32 bit Version 2.0 - L1 cache, L2 cache and RAM
6 11287 12793 13466 13625 13407 13281 23648
96 1494 1490 2974 5854 10509 13147 23781
393210 447 453 901 1830 3097 5206 7276
|
To Start
RandMem Benchmark
RandMem benchmark carries out eight tests at increasing data sizes to produce data transfer speeds in MBytes Per Second from caches and memory. Serial and random address selections are employed, using the same program structure, with read and read/write tests for 32 bit integers and 64 bit floating point numbers. In both cases, 32 bit integers are used.
The main purpose is to demonstrate how much slower performance can be through using random access. Here, speed can be considerably influenced by reading and writing in bursts, where much of the data is redundant, and by the size of preceding caches.
Below, all 64 bit results are shown for a Phenom along with sample speeds at 32 bits and for a Core 2 Duo at 64 bits. Many of the low order speeds are similar at 32 bits and 64 bits but, using RAM, some relationships change, with integer random access becoming progressively worse at 64 bits. The lower GHz Core 2 Duo performs better on some tests.
Later results are for the Core i7, which is much faster than the earlier systems, particularly relative to CPU clock speed.
The 32 bit and 64 bit benchmarks, source code and instructions can be downloaded in
memory_benchmarks.tar.gz
with more details and results in
Linux Results RandMem.
Core i7 4820K 3.9 GHz Turbo Boost - 1 CPU
Random/Serial Memory Test 64 Bit Version 2 Sat Nov 8 12:10:51 2014
Integer....................... Double/Integer................
Serial........ Random........ Serial........ Random........
RAM Read Rd/Wrt Read Rd/Wrt Read Rd/Wrt Read Rd/Wrt
KB MB/Sec MB/Sec MB/Sec MB/Sec MB/Sec MB/Sec MB/Sec MB/Sec
6 26914 28379 26521 25259 30506 43477 30389 43791 L1
12 26984 28876 26341 28078 29905 43462 29909 43020
24 27062 29098 26526 28219 29865 43649 29832 42931
48 23161 23723 18749 12718 29702 33997 29670 30451 L2
96 23203 23731 13790 8816 29766 33586 22909 14830
192 23378 23626 11539 7634 29685 32647 18371 12232
384 22366 18631 8073 5883 27876 24687 14813 10078 L3
768 22290 18024 6043 4978 27801 23322 10159 8041
1536 22305 18023 5407 4576 27801 23316 8801 7311
3072 22449 18119 5170 4374 27443 23151 8202 6887
6144 22392 18111 5040 4269 27867 23187 7970 6683
12288 15007 11910 2499 2698 20487 16022 4276 4837 RAM
24576 13928 11206 1332 1336 17949 13729 2324 2389
49152 13987 11299 1068 1061 17771 13626 1750 1774
98304 14041 11331 971 864 18568 13699 1586 1558
196608 14031 11379 927 685 18627 13752 1491 1175
393216 14044 11397 908 623 18637 13741 1450 992
786432 14037 11373 898 603 18579 13650 1430 935
1572864 13844 11407 890 614 18624 13720 1418 924
At 32 bits
6 24759 28651 24162 27110 30309 42529 30315 42969
96 22385 23808 13417 8855 29721 34194 23310 14622
1536 21480 18032 5369 4573 26884 23312 8845 7302
393216 13743 11378 906 693 18574 13708 1450 1097
786432 13809 11398 896 670 18578 13753 1430 1033
AMD Phenom(tm) II X4 945 Processor 3.0 GHz
Random/Serial Memory Test 64 Bit Version 2 Tue Dec 14 17:21:46 2010
Integer....................... Double/Integer................
Serial........ Random........ Serial........ Random........
RAM Read Rd/Wrt Read Rd/Wrt Read Rd/Wrt Read Rd/Wrt
KB MB/Sec MB/Sec MB/Sec MB/Sec MB/Sec MB/Sec MB/Sec MB/Sec
6 12542 9137 12636 9066 16812 13621 16795 13621
12 12613 9165 12676 9137 17022 13705 17013 13673
24 12647 9179 12734 9157 17129 13720 17130 13694
48 12664 9186 12775 9161 17183 13728 17183 13719
96 11989 8464 6866 5221 16934 11776 16496 11888
192 7778 8434 3703 3177 16902 11747 7146 6132
384 7778 8437 3001 2749 16918 11671 5116 4730
768 4956 7348 1954 1900 9978 9459 3670 3591
1536 4763 7201 1404 1388 9748 9346 2488 2474
3072 4016 6914 1078 1045 9531 9200 2048 2043
6144 3668 6769 750 661 9004 8719 1405 1280
12288 2771 3636 590 502 6688 5495 1012 848
24576 2850 3592 504 450 6706 5506 841 736
49152 2858 3583 439 402 6719 5332 727 659
98304 2679 3536 333 307 6697 5490 612 564
196608 2729 3548 266 241 6945 5445 459 422
393216 2866 3559 229 200 6931 5490 377 336
786432 2870 3547 192 167 6938 5499 327 283
At 32 bits
6 14488 11399 12852 11133 16741 20258 16789 19825
96 11088 9912 6861 5520 16960 16197 16554 14645
1536 8044 7528 1410 1390 9668 9223 2475 2461
393216 4296 3575 281 258 6668 5497 491 458
786432 4296 3562 238 212 6841 5492 396 361
Intel Core 2 CPU 6600 @ 2.40GHz
At 64 bits
6 9142 12213 9154 5161 13728 16211 13727 15654
96 8019 9473 4113 3701 11381 11971 7382 6419
1536 7978 8586 2691 2497 11269 11044 4760 4222
393216 3285 2273 238 207 5705 2999 503 374
786432 3297 2277 149 152 5637 3001 297 281
|
To Start
SSEfpu Benchmark
This is a variation of the
SSE3DNow Benchmark
with extensions but excluding AMD 3DNow tests. The benchmark measures Single Precision (SP) and Double Precision (DP) Floating Point speeds, data streaming from caches and RAM. It uses SSE (SP) and SSE2 (DP) assembly code instructions, along with compiled C code that produces the old x87 instructions at 32 bits and SSE type for working on a 64 bit system.
The additional tests avoid intermediate register to register operations using s=(s+x[m])*y[m] and s=s+x[m]+y[m] to produce much faster speeds.
The AMD processor performs relatively better on the extra test, with linked add and multiply, at 7.11 floating point results per clock cycle on the Phenom. Then, the Core i7 regains the lost ground and also obtains a high throughput on RAM based data.
The 32 bit and 64 bit benchmarks, source code and instructions can be downloaded in
memory_benchmarks.tar.gz
with more details and results in
Linux Results SSEfpu.
Core i7 4820K 3.9 GHz Turbo Boost - 1 CPU
SSE & SSE2 Memory Reading Speed Test 64-Bit Version 2.1
0.100 seconds per test, Start Tue Dec 2 17:46:19 2014
Memory --s=s+x[m]*y[m]--- --x[m]=x[m]+y[m]-- (s+x[m])?y[m]
KBytes SSE2 SSE Sngl SSE2 SSE Sngl +*SSE ++SSE
Used MB/S MB/S MB/S MB/S MB/S MB/S MB/S MB/S
4 40997 41051 10763 78459 75013 28446 87877 59793 L1
8 41168 41321 10588 78338 78301 27627 96326 60640
16 41366 41444 10505 80368 80739 27706 98675 60593
32 41423 41511 10462 80669 81160 27759 92764 60609
64 41432 41427 10445 50083 50209 27169 57689 57389 L2
128 41447 41508 10412 49595 49500 27192 55731 56598
256 39673 39746 10414 46176 46119 26167 48386 48563
512 37246 37301 10417 32252 32250 24247 39595 39688 L3
1024 36639 36601 10425 31307 31197 24044 38688 38794
2048 36640 36824 10421 31262 31328 24138 38804 38750
4096 36900 36899 10393 31379 31381 24227 38739 38942
8192 36585 36615 10403 31076 31063 24076 38442 38534
16384 23186 23097 9577 15371 15292 16067 22518 22562 RAM
32768 22592 22574 9573 14973 15013 15743 21935 22058
65536 22603 22504 9596 15041 14972 15718 22061 22052
131072 22612 22612 9582 15038 15030 15672 22096 22003
262144 22629 22610 9584 15049 15044 15698 22040 22109
524288 22638 22654 9592 15057 15056 15682 22101 22101
1048576 22618 22481 9598 15038 15049 15605 22110 22104
2097152 22671 22648 9608 15050 15051 15546 22094 22129
4194304 22671 22668 9597 15044 15056 15691 22112 22128
SSE2 SSE Norm SSE2 SSE Norm SSE SSE
Maximum DP SP SP DP SP SP SP SP
MFLOPS 5181 10378 2691 5042 10145 3556 24669 15160
MFLOPS/MHz 1.33 2.66 0.69 1.29 2.60 0.91 6.33 3.89
MB/sec at 32 bits
8 41382 41382 10592 79081 78697 20892 92411 61511
128 41604 41586 10436 49128 49126 18239 55914 56067
1024 36098 35957 10425 31113 31127 16998 38204 38336
131072 21010 20979 10092 14783 14774 12497 20655 20626
AMD Phenom(tm) II X4 945 Processor 3.0 GHz
SSE & SSE2 Memory Reading Speed Test 64-Bit Version 2.0
0.100 seconds per test, Start Tue Dec 21 12:18:05 2010
Memory --s=s+x[m]*y[m]--- --x[m]=x[m]+y[m]-- (s+x[m])?y[m]
KBytes SSE2 SSE Sngl SSE2 SSE Sngl +*SSE ++SSE
Used MB/S MB/S MB/S MB/S MB/S MB/S MB/S MB/S
4 22773 22689 6156 43460 42950 23333 66361 41700
8 23421 23377 6089 45716 45433 23624 78620 44642
16 23623 23691 6059 42561 42562 23724 84534 45885
32 23834 23827 6043 45141 45140 23797 82980 46315
64 23921 23918 6035 44686 45478 23823 85405 46897
128 23859 23901 6029 22154 22157 17973 23785 23782
256 23821 23764 6027 21555 21535 18026 23888 23889
512 19300 19264 6010 17865 17840 16359 19219 19222
1024 10376 10379 5965 10168 10168 10228 10371 10373
2048 10369 10372 5966 10163 10163 10236 10369 10368
4096 10261 10281 5862 9975 9975 10025 10278 10278
8192 8053 8190 5362 6841 6836 6863 8029 8027
16384 7985 8095 5327 6572 6569 6651 7848 7883
32768 8074 8099 5314 6424 6531 6660 7858 7928
65536 8148 8151 5321 6599 6607 6674 7961 7961
131072 8092 8159 5320 6585 6412 6484 7891 7936
262144 8112 8173 5318 6580 6556 6665 7887 7960
524288 8117 8042 5327 6607 6604 6689 7861 7961
1048576 8147 8108 5328 6535 6581 6668 7941 7816
SSE2 SSE Norm SSE2 SSE Norm SSE SSE
Maximum DP SP SP DP SP SP SP SP
MFLOPS 2990 5980 1539 2857 5685 2978 21351 11724
MFLOPS/MHz 0.99 1.99 0.51 0.95 1.95 0.99 7.11 3.90
MB/sec at 32 bits
Different #####
8 23188 23276 6057 45641 43156 11688 78703 44729
128 23634 23692 5997 22418 22250 9893 23671 23664
1024 10248 10254 5930 10056 10053 8682 10253 10253
131072 8258 8276 5389 6680 6698 6098 7909 8091
Intel Core 2 CPU 6600 @ 2.40GHz
At 64 bits
Different ##### ##### #####
8 25420 25368 6506 37691 37692 13152 36503 36637
128 18481 18655 6406 17105 17107 12704 19725 19744
1024 18517 18749 6391 17136 17137 12690 19803 19822
131072 6444 6419 5455 3955 3956 3863 6399 6393
Maximum
MFLOPS/MHz 1.32 2.64 0.68 0.98 1.96 0.68 3.80 3.81
|
To Start
nVidia CUDA Benchmarks and Burn-in Tests
CUDA, from nVidia, provides programming functions to use GeForce graphics processors for general purpose computing. These functions are easy to use in executing arithmetic instructions on numerous processing elements simultaneously. This is for Single Instruction Multiple Data (SIMD) operation, where the same instructions can be executed simultaneously on sections of data from a data array. For maximum speeds, the data array has to be large and with little or no references to graphics or host CPU RAM. To assist in this, CUDA hardware provides a large number of registers and high speed cache like memory.
The benchmarks measure floating point speeds in Millions of Floating Point Operations Per Second (MFLOPS). They demonstrates some best and worst case performance using varying data array size and increasing processing instructions per data access. There are five scenarios - New Calculations with data in and out, Update Data with just data out, Graphics Only Data using only graphics RAM and two extra tests with lower overheads.
The tests are run at three different data sizes, defaults 100,000 words repeated 2500 times, 1M words 250 times and 10M words 25 times. The arithmetic operations executed are of the form x[i] = (x[i] + a) * b - (x[i] + c) * d + (x[i] + e) * f with 2, 8 or 32 adds or subtracts and multiplies on each data element. The Extra Tests are only run using 10M words repeated 25 times.
The benchmarks, source code and instructions can be downloaded in
linux_cuda_mflops.tar.gz
with more details and results in
linux_cuda_mflops.htm,
the latter showing how to use the benchmarks as reliability-burn-in tests.
Added 2014 results are for a mid range GeForce GTX 650, with a 3.7 GHz Core i7, via Windows 8.1 and Ubuntu 14.04. A maximum 412 GFLOPS was demonstrated, making it more than twice as fast as a more expensive GTS 250, from three years earlier. The i7 Asus P9X79 LE motherboard has PCI Express 3.0 x 16 which, along with faster RAM and CPU GHz, produces the fastest speeds, so far, where data in and out or out only is involved. Earlier systems probably had PCIe 1 with maximum bandwidth is 4 GB/s, or PCIe 2 at 8 GB/s, compared with 15.74 GB/s for PCIe 3. Below are full results for GTS 250 and the GTX 650.
Phenom II 3.0 GHz GeForce GTS 250
Linux CUDA 3.2 x64 32 Bits SP MFLOPS Benchmark 1.4 Wed Dec 29 15:35:35 2010
CUDA devices found
Device 0: GeForce GTS 250 with 16 Processors 128 cores
Global Memory 999 MB, Shared Memory/Block 16384 B, Max Threads/Block 512
Using 256 Threads
Test 4 Byte Ops Repeat Seconds MFLOPS First All
Words /Wd Passes Results Same
Data in & out 100000 2 2500 1.035893 483 0.9295383095741 Yes
Data out only 100000 2 2500 0.514445 972 0.9295383095741 Yes
Calculate only 100000 2 2500 0.082464 6063 0.9295383095741 Yes
Data in & out 1000000 2 250 0.706176 708 0.9925497770309 Yes
Data out only 1000000 2 250 0.380928 1313 0.9925497770309 Yes
Calculate only 1000000 2 250 0.051266 9753 0.9925497770309 Yes
Data in & out 10000000 2 25 0.639933 781 0.9992496371269 Yes
Data out only 10000000 2 25 0.339051 1475 0.9992496371269 Yes
Calculate only 10000000 2 25 0.041672 11999 0.9992496371269 Yes
Data in & out 100000 8 2500 1.013196 1974 0.9569796919823 Yes
Data out only 100000 8 2500 0.490317 4079 0.9569796919823 Yes
Calculate only 100000 8 2500 0.088028 22720 0.9569796919823 Yes
Data in & out 1000000 8 250 0.666709 3000 0.9955092668533 Yes
Data out only 1000000 8 250 0.351320 5693 0.9955092668533 Yes
Calculate only 1000000 8 250 0.052704 37948 0.9955092668533 Yes
Data in & out 10000000 8 25 0.620265 3224 0.9995486140251 Yes
Data out only 10000000 8 25 0.335467 5962 0.9995486140251 Yes
Calculate only 10000000 8 25 0.044453 44992 0.9995486140251 Yes
Data in & out 100000 32 2500 1.057142 7568 0.8900792598724 Yes
Data out only 100000 32 2500 0.531691 15046 0.8900792598724 Yes
Calculate only 100000 32 2500 0.128706 62157 0.8900792598724 Yes
Data in & out 1000000 32 250 0.688714 11616 0.9880728721619 Yes
Data out only 1000000 32 250 0.375411 21310 0.9880728721619 Yes
Calculate only 1000000 32 250 0.075172 106423 0.9880728721619 Yes
Data in & out 10000000 32 25 0.644074 12421 0.9987990260124 Yes
Data out only 10000000 32 25 0.357000 22409 0.9987990260124 Yes
Calculate only 10000000 32 25 0.062001 129029 0.9987990260124 Yes
Extra tests - loop in main CUDA Function
Calculate 10000000 2 25 0.050288 9943 0.9992496371269 Yes
Shared Memory 10000000 2 25 0.009206 54313 0.9992496371269 Yes
Calculate 10000000 8 25 0.049608 40316 0.9995486140251 Yes
Shared Memory 10000000 8 25 0.017254 115916 0.9995486140251 Yes
Calculate 10000000 32 25 0.050531 158320 0.9987990260124 Yes
Shared Memory 10000000 32 25 0.046626 171580 0.9987990260124 Yes
##############################################################################
Core i7 4820K 3.9 GHz Turbo Boost GeForce GTX 650
Linux CUDA 3.2 x64 32 Bits SP MFLOPS Benchmark 1.4 Tue Dec 30 22:50:52 2014
CUDA devices found
Device 0: GeForce GTX 650 with 2 Processors 16 cores
Global Memory 999 MB, Shared Memory/Block 49152 B, Max Threads/Block 1024
Using 256 Threads
Test 4 Byte Ops Repeat Seconds MFLOPS First All
Words /Wd Passes Results Same
Data in & out 100000 2 2500 0.837552 597 0.9295383095741 Yes
Data out only 100000 2 2500 0.389646 1283 0.9295383095741 Yes
Calculate only 100000 2 2500 0.085709 5834 0.9295383095741 Yes
Data in & out 1000000 2 250 0.441478 1133 0.9925497770309 Yes
Data out only 1000000 2 250 0.229017 2183 0.9925497770309 Yes
Calculate only 1000000 2 250 0.051727 9666 0.9925497770309 Yes
Data in & out 10000000 2 25 0.369060 1355 0.9992496371269 Yes
Data out only 10000000 2 25 0.201172 2485 0.9992496371269 Yes
Calculate only 10000000 2 25 0.048027 10411 0.9992496371269 Yes
Data in & out 100000 8 2500 0.708377 2823 0.9571172595024 Yes
Data out only 100000 8 2500 0.388206 5152 0.9571172595024 Yes
Calculate only 100000 8 2500 0.092254 21679 0.9571172595024 Yes
Data in & out 1000000 8 250 0.478644 4178 0.9955183267593 Yes
Data out only 1000000 8 250 0.231182 8651 0.9955183267593 Yes
Calculate only 1000000 8 250 0.053854 37138 0.9955183267593 Yes
Data in & out 10000000 8 25 0.370669 5396 0.9995489120483 Yes
Data out only 10000000 8 25 0.202392 9882 0.9995489120483 Yes
Calculate only 10000000 8 25 0.049263 40599 0.9995489120483 Yes
Data in & out 100000 32 2500 0.725027 11034 0.8902152180672 Yes
Data out only 100000 32 2500 0.407579 19628 0.8902152180672 Yes
Calculate only 100000 32 2500 0.113188 70679 0.8902152180672 Yes
Data in & out 1000000 32 250 0.497855 16069 0.9880878329277 Yes
Data out only 1000000 32 250 0.261461 30597 0.9880878329277 Yes
Calculate only 1000000 32 250 0.060132 133042 0.9880878329277 Yes
Data in & out 10000000 32 25 0.375882 21283 0.9987964630127 Yes
Data out only 10000000 32 25 0.207640 38528 0.9987964630127 Yes
Calculate only 10000000 32 25 0.054718 146204 0.9987964630127 Yes
Extra tests - loop in main CUDA Function
Calculate 10000000 2 25 0.018107 27613 0.9992496371269 Yes
Shared Memory 10000000 2 25 0.007775 64308 0.9992496371269 Yes
Calculate 10000000 8 25 0.025103 79671 0.9995489120483 Yes
Shared Memory 10000000 8 25 0.008724 229241 0.9995489120483 Yes
Calculate 10000000 32 25 0.036397 219797 0.9987964630127 Yes
Shared Memory 10000000 32 25 0.019414 412070 0.9987964630127 Yes
|
To Start
Disk, Bus and LAN Benchmarks
These benchmark tests are based on those produced for Windows, where details and results can be found in
DiskGraf Results.htm and
CDDVDSpd Results.htm.
The tests comprise:
- Writing and Reading Large Files - Five files each of 8 MB, 16 MB and 32 MB are used.
System is instructed not to cache the data.
- Writing and Reading Cached Data - Five files of 8 MB are used. Performance normally
reflects memory speed.
- Reading Bus Speed - The same data is read repetitively at block sizes between 64 KB and
1 MB. This normally reads data from the disk’s buffer to show maximum bus speeds.
- Random Reading Speed - 1 KB blocks are read randomly from 7 file sizes between 2 MB
and 128 MB. Results reflect the disk's buffer size and rotation speed.
- Writing and Reading Small Files - 500 files are written, read and deleted at 6 different
file sizes each between 2 KB and 64 KB. Besides speed, milliseconds per file is provided to reflect overheads.
- Run time parameters - These are provided to write and read larger files and to specify
the drive and file path to be used.
Besides testing disk and flash memory drives, it was intended to use the (drivespeed) benchmarks for measuring speed over such as Local Area Networks (LANs). In order to avoid data being cached in main memory by the Operating System, the program uses direct I/O (file open parameter O_DIRECT for Linux). This prevented directories being mounted over a LAN, so a second program (lanspeed) was produced, identical except with no direct I/O parameter. Compilations at both 32 bits and 64 bits were produced - drivespeed32, lanspeed32, drivespeed64 and lanspeed64.
The lanspeed tests can be used to measure speeds between Linux platforms and also between Linux and Windows systems. A Windows program, drivespeed32.exe is also provided and this can also be used as a LAN speed test.
The execution files, source code along with compiling and running instructions, can be downloaded in
linux_disk_usb_lan_benchmarks.tar.gz
with
linux_disk_usb_lan_benchmarks.htm.
providing details and results. Example results are below.
The latest version has an added test to measure Random Writing Speed.
Second below are 2014 results on the 3.7 GHz Core i7, via Ubuntu 14.04, using a Seagate Expansion USB 3.0 disk drive. Further details and comparisons with a number of Flash Drives are in
the results report.
Current Directory Path:
/media/f816ec76-8bf2-4dd3-9e98-62934909a779/roy/all64/drivespeed2
Total MB 11263, Free MB 9513, Used MB 1750
Linux Storage Speed Test 64-Bit Version 1.1, Tue Feb 1 14:20:39 2011
8 MB File 1 2 3 4 5
Writing MB/sec 4.33 76.73 76.15 82.40 105.84
Reading MB/sec 57.37 86.62 83.40 80.74 82.34
16 MB File 1 2 3 4 5
Writing MB/sec 73.94 108.16 72.53 116.19 116.12
Reading MB/sec 70.39 103.31 120.31 121.53 121.48
32 MB File 1 2 3 4 5
Writing MB/sec 113.01 76.67 73.20 115.83 116.05
Reading MB/sec 105.19 102.41 113.15 121.55 120.59
---------------------------------------------------------------------
8 MB Cached File 1 2 3 4 5
Writing MB/sec 1271.71 1503.73 1496.38 1493.27 1491.68
Reading MB/sec 3406.70 4015.11 4079.82 4081.24 4080.77
---------------------------------------------------------------------
Bus Speed Block KB 64 128 256 512 1024
Reading MB/sec 84.93 102.31 112.31 121.03 116.41
---------------------------------------------------------------------
1 KB Reads File MB > 2 4 8 16 32 64 128
Random Read msecs 0.43 0.39 0.45 3.01 4.49 5.93 6.69
---------------------------------------------------------------------
500 Files Write Read Delete
File KB MB/sec ms/File MB/sec ms/File Seconds
2 7.54 0.27 7.67 0.27 0.015
4 17.19 0.24 22.27 0.18 0.018
8 20.24 0.40 27.21 0.30 0.017
16 33.27 0.49 47.16 0.35 0.019
32 52.67 0.62 67.20 0.49 0.016
64 55.43 1.18 75.49 0.87 0.015
######################################################################
3.7 GHz Core i7, Seagate Expansion USB 3.0 Disk Drive
Current Directory Path:
/home/roy/benchmarks/Old/drivespeed
Total MB 446040, Free MB 435358, Used MB 10681
Linux Storage Speed Test 64-Bit Version 1.2, Sun Dec 28 11:36:15 2014
8 MB File 1 2 3 4 5
Writing MB/sec 165.25 70.00 29.78 26.55 41.54
Reading MB/sec 28.61 68.77 74.49 89.81 148.71
16 MB File 1 2 3 4 5
Writing MB/sec 94.83 105.93 90.70 101.86 88.25
Reading MB/sec 70.23 90.52 84.74 43.40 98.24
32 MB File 1 2 3 4 5
Writing MB/sec 118.93 102.33 95.05 94.94 105.92
Reading MB/sec 85.99 102.28 99.45 104.34 112.30
---------------------------------------------------------------------
8 MB Cached File 1 2 3 4 5
Writing MB/sec 2388.78 2453.24 2468.73 2351.90 2472.20
Reading MB/sec 7077.93 8329.63 8966.46 8957.32 8925.51
---------------------------------------------------------------------
Bus Speed Block KB 64 128 256 512 1024
Reading MB/sec 165.98 146.73 177.92 197.84 202.40
---------------------------------------------------------------------
1 KB Blocks File MB > 2 4 8 16 32 64 128
Random Read msecs 0.17 0.15 0.17 2.33 6.44 6.90 8.04
Random Write msecs 0.12 0.19 0.14 1.70 13.34 2.39 8.61
---------------------------------------------------------------------
500 Files Write Read Delete
File KB MB/sec ms/File MB/sec ms/File Seconds
2 7.48 0.27 12.00 0.17 0.004
4 25.33 0.16 29.68 0.14 0.004
8 48.45 0.17 32.84 0.25 0.008
16 73.08 0.22 37.87 0.43 0.004
32 80.54 0.41 55.88 0.59 0.004
64 107.98 0.61 82.93 0.79 0.009
|
To Start
Burn-In and Reliability Testing Apps
A new set of programs have been designed for soak testing Linux based PCs. The execution files and source code along with compile and run instructions can be downloaded in
linux_burn-in_apps.tar.gz.
Full details and results are provided in
linux burn-in apps.htm.
These programs are intended to stress test CPUs, caches, RAM, buses, disks and other drives using high processing speeds, to induce heating effects, and varying data bit order, to investigate possible pattern conscious faults. Common features are command line options to specify memory/storage demands, running time and different results log file names, for use in multiprocessor tests. Data read and results of calculations are also checked for correct or consistent values. Versions compiled to run on 32-Bit and 64-Bit processors are provided.
Three new programs provided are BurnInSSE, IntBurn and DriveStress but they can also be used in conjunction with program produced earlier. BurnInSSE64 and BurnInSSE32 were compiled to use the same range of SSE floating point instructions, where GCC generates fast execution speeds. The IntBurn tests are based on assembly code with IntBurn32 using 32 bit integers and IntBurn64 accessing a larger number of 64 bit registers.
DriveStress32 and DriveStress64 were compiled from the same C code and measure drive and bus speeds (e.g. SATA or USB) whilst checking data read for correct values.
Earlier programs, that also have reliability testing options and included in the package, are
Livermore Loops and nVidia CUDA Benchmarks.
Successes - Three significant problems were identified during testing. The first was apparent excessive temperatures on a desktop PC, compared with earlier measurements via Windows. This was cured by clearing dust out of the CPU heatsink using a compressed air sprayer. Then there were two Linux Peculiarities that seem to be affected by power saving options. A desktop PC with a Core 2 Duo CPU showed a throughput increase of three times using both cores. Here, using one core with “On-Demand” CPU GHz (via Frequency Scaling Monitor), the processor was running at 1.6 GHz instead of 2.4 GHz. Then a laptop, again with a Core 2 Duo PC, overheated, causing the CPU to run at less than half speed. Unlike using Windows, with power on to Ubuntu, initial CPU temperatures were high with the fan not appearing to run as fast as it might. On an apparent random basis, the laptop started at a lower temperature and did not overheat, with the fan apparently running at high speed.
Paging/Swapping Tests - Running multiple copies of the processor exercise programs, with appropriate parameters to demand more main memory capacity than is available, will lead to data being swapped out/in to/from disk. However, with excessive demands, running times can be unpredictable.
Multitasking Scripts - Examples are provided showing how to mix and match programs and run time parameter to soak test complete systems for as long as is required. They also demonstrate how to organise dynamic displayed results in multiple X terminal windows.
The test programs display and log results of calculations and speeds at regular intervals. Examples are shown below, with interpretation and more details in
linux burn-in apps.htm.
The htm report includes results on the Core i7, showing variances caused by Hyperthreading. The tests comprised six copies of BurnInSSE and the most demanding CUDA Shared Memory test, over 10 minutes. Temperatures were measured using Psensor and CPU results shown are averages over readings for four cores. CPU GFLOPS are total from the six different streams. The CUDA program uses more than 100% of one core and the CPU produces more GFLOPS than 4 times that from one core, due to hyperthreading effects. Maximum temperatures are not excessive.
IntBurn
Test 4 KB at 10x2 seconds per test, Start at Thu Mar 17 12:00:59 2011
Write/Read
1 10529 MB/sec Pattern 0000000000000000 Result OK 25705389 passes
2 10579 MB/sec Pattern FFFFFFFFFFFFFFFF Result OK 25826660 passes
3 10592 MB/sec Pattern A5A5A5A5A5A5A5A5 Result OK 25858754 passes
4 10587 MB/sec Pattern 5555555555555555 Result OK 25846727 passes
5 10601 MB/sec Pattern 3333333333333333 Result OK 25880968 passes
6 10602 MB/sec Pattern F0F0F0F0F0F0F0F0 Result OK 25883259 passes
Max 2236 64 bit MIPS
Read
1 16941 MB/sec Pattern 0000000000000000 Result OK 82719400 passes
2 16946 MB/sec Pattern FFFFFFFFFFFFFFFF Result OK 82744300 passes
3 16932 MB/sec Pattern A5A5A5A5A5A5A5A5 Result OK 82676600 passes
4 16927 MB/sec Pattern 5555555555555555 Result OK 82653700 passes
5 16883 MB/sec Pattern 3333333333333333 Result OK 82439400 passes
6 16857 MB/sec Pattern F0F0F0F0F0F0F0F0 Result OK 82311300 passes
Max 2515 64 bit MIPS
BurnInSSE
Using 400 KBytes, 32 Operations Per Word, For Approximately 1 Minutes
Pass 4 Byte Ops/ Repeat Seconds MFLOPS First All
Words Word Passes Results Same
1 100000 32 67500 15.10 14304 0.356166393 Yes
2 100000 32 67500 15.11 14296 0.356166393 Yes
3 100000 32 67500 15.09 14312 0.356166393 Yes
4 100000 32 67500 15.33 14091 0.356166393 Yes
DriveStress
File size 10.25 MB x 4 files, minimum reading time 1 minutes
File 1 10.25 MB written in 0.12 seconds
File 2 10.25 MB written in 0.14 seconds
File 3 10.25 MB written in 0.11 seconds
File 4 10.25 MB written in 0.14 seconds
Start Reading Sun Apr 17 20:06:07 2011
Read passes 18 x 4 Files x 10.25 MB in 0.25 minutes
Read passes 36 x 4 Files x 10.25 MB in 0.51 minutes
Read passes 54 x 4 Files x 10.25 MB in 0.76 minutes
Read passes 72 x 4 Files x 10.25 MB in 1.01 minutes
Start Repeat Read Sun Apr 17 20:08:08 2011
Passes in 1 second(s) for each of 164 blocks of 64KB:
1440 1480 1480 1480 1480 1400 1480 1480 1480 1460 1380
1480 1480 1460 1480 1440 1440 1480 1480 1480 1440 1460
1480 1440 1480 1460 1500 1460 1480 1760 1540 1480 1480
1440 1480 1480 1480 1480 1460 1440 1480 1480 1480 1460
+ another 120 results
No errors found during reading tests
############################################################################
Core i7 3.7 GHz, GeForce GTX 650
Stand Max
Alone Over
------------------ GFLOPS ------------------ 15s
4 CPU 90 86 99 83 96 86 86 88 97 99 109 116
GPU 430 430 430 430 430 430 430 430 430 430 430 430
Minute 0 1 2 3 4 5 6 7 8 9 10
Rise
------------------- °C -----------------------
CPUs 32 55 58 60 61 62 62 63 63 63 63 31
GPU 30 46 53 56 58 59 59 60 60 60 60 30
|
To Start
Multithreading Benchmarks
These multithreading tests are based on the above benchmarks, in turn,
Maximum CPU Speeds,
Whetstone Classic Benchmark,
Original OpenMP Benchmark,
MemSpeed Benchmark,
BusSpeed Benchmark and
RandMem Benchmark.
For further details, sample results, benchmark programs, source code and instructions see
linux multithreading benchmarks.htm and
linux_multithreading_apps.tar.gz.
See also results on a Core i7 with 4 cores plus 4 Hyperthreading
Six benchmarks are provided that can run using up to 64 concurrent threads, with versions compiled to run using 64 bit or 32 bit systems. Performance is mainly measured as Millions of Instructions Per Second (MIPS), Millions of Floating Point Operations Per Second (MFLOPS) or Millions of Bytes per Second (MB/S).
Simple Add Tests - execute 32 bit or 64 bit integer instructions and 128 bit SSE floating point functions via assembly language. These use simple add operations with little access to external data. Resultant performance is generally proportional to the number of CPU cores with some gains also identified when Hyperthreading is available. Each thread executes independent code.
Whetstone Benchmark - is the first general purpose benchmark that set industry standards of computer system performance, mainly dependent on floating point speed but with some independently timed integer test functions. Data used is generally contained in L1 cache with performance gains again proportional to the number of cores. Each thread again executes independent code.
MP MFLOPS Program - uses the same functions as my CUDA and OpenMP benchmarks, comprising routines with 2, 8 and 32 add or multiply floating point calculations with data from higher level caches or RAM. The 64 bit version compiles using SSE floating point, where up to 6 MFLOPS per CPU MHz per core can be produced. The 32 bit program uses the much slower original 80387 FPU instructions. These programs can also be used as burn-in/reliability tests. Each thread executes the same functions but on a different segment of the data,
MP Memory Speed Tests - employ three sequences of operations, using double and single precision floating point numbers and integers, on data sized between 4 KB and 25% of RAM size. The operations are memory to memory transfers with 0, 1 and 2 arithmetic calculations. The 64 bit version again uses SSE functions but not as efficiently as MP MFLOPS. Again each thread has the same procedures using different segments of the data.
Calculations are the same as MemSpeed Benchmark, used with OpenMP, where there is no programmable control on the order in which data is accessed.
MP Memory Bus Speed Tests - read data at a range of sizes covering caches and RAM. Data is accessed with varying address increments to identify reading data in bursts over the bus and allow estimation of maximum bus/memory speed. This time, each thread reads all the data. The 64 bit version uses the double size 8 byte words, where data transfer speed can be twice that of the 32 bit compilation, demonstrating that 32 and 64 bit integer instructions can execute at the same speed.
MP Memory Random Access Speed Benchmark - comprises serial and random access read and read/write tests that cover cache and RAM data sizes. All threads access the same data but starting at different points. In this case, data could be corrupted with concurrent updates, but the Operating System appears to flush caches to avoid this, producing extremely slow performance. Extra tests (Mutex) avoid this conflict by executing one read/write test at a time, leading to some slower and some faster speeds. Random access can be affected by burst reading/writing with associated poor performance.
Examples of results log format on a quad core 3.0 GHz Phenom II are given below.
Simple Add Tests
Multithreading Add Test 64 bit Version 1.0 Thu May 5 11:35:18 2011
Integer Additions 4 Threads
Thread 4 - 8281 64 bit Integer MIPS
Thread 2 - 7996 64 bit Integer MIPS
Thread 1 - 7815 64 bit Integer MIPS
Thread 3 - 7800 64 bit Integer MIPS
Total - 31892 64 Bit Integer MIPS
Aggregate - 31201 64 Bit Integer MIPS, based on last to finish
SSE Floating Point Additions 4 Threads
Thread 2 - 12030 32 Bit SSE MFLOPS
Thread 3 - 11976 32 Bit SSE MFLOPS
Thread 4 - 11861 32 Bit SSE MFLOPS
Thread 1 - 11692 32 Bit SSE MFLOPS
Total - 47559 32 Bit SSE MFLOPS
Aggregate - 46770 32 Bit SSE MFLOPS, based on last to finish
Whetstone MP Benchmark
Multithreading Single Precision Whetstones 64-Bit Version 1.0
Using 4 threads - Sat May 14 12:03:51 2011
MWIPS MFLOPS MFLOPS MFLOPS Cos Exp Fixpt If Equal
Thread 1 2 3 MOPS MOPS MOPS MOPS MOPS
1 2861 927 872 747 71 38 2947 2259 629
2 2865 875 892 745 71 38 3294 2198 641
3 2875 869 892 744 71 38 3408 2202 645
4 2896 906 895 744 72 38 3141 2232 651
Total 11496 3577 3550 2979 285 151 12790 8891 2566
MWIPS 11389 Based on time for last thread to finish
MP MFLOPS Benchmark
64 Bit MP SSE MFLOPS Benchmark 1, 4 Threads, Tue May 17 19:00:43 2011
Test 4 Byte Ops/ Repeat Seconds MFLOPS First All
Words Word Passes Results Same
Data in & out 102400 2 10000 0.091754 22321 0.764063 Yes
Data in & out 1024000 2 1000 0.136134 15044 0.970753 Yes
Data in & out 10240000 2 100 0.632075 3240 0.997008 Yes
Data in & out 102400 8 10000 0.167023 49047 0.850923 Yes
Data in & out 1024000 8 1000 0.176219 46488 0.982342 Yes
Data in & out 10240000 8 100 0.658828 12434 0.998200 Yes
Data in & out 102400 32 10000 0.558509 58670 0.660143 Yes
Data in & out 1024000 32 1000 0.556450 58888 0.953631 Yes
Data in & out 10240000 32 100 0.722131 45377 0.995203 Yes
MP Memory Speed
MP Memory Reading Speed Test 64 Bit Version 1 Using 4 Threads
Start of test Tue Jun 7 11:32:54 2011
Memory x[m]=x[m]+s*y[m] Int+ x[m]=x[m]+y[m] x[m]=y[m]
KBytes Dble Sngl Int64 Dble Sngl Int64 Dble Sngl Int64
Used MB/S MB/S MB/S MB/S MB/S MB/S MB/S MB/S MB/S
4 15704 11347 10961 17813 12518 15904 13744 8714 8758
8 24188 15367 14929 26770 17870 21025 20789 10866 10234
16 33319 19229 18266 38724 23589 23124 31390 13114 13157
32 40697 20675 21180 51120 27260 25282 39385 13921 13960
65 45013 22913 22267 57143 30132 24875 42247 14314 14241
131 45569 23573 22953 61979 31356 27585 44688 14427 13289
262 48701 23759 22666 63235 32103 27892 44447 14200 14453
524 44900 22996 20417 53167 30753 25832 36085 14671 13403
1048 44929 23357 20300 54596 30302 25790 36207 14708 13590
2097 42017 22864 20927 42429 28809 24778 26734 13125 12659
4194 34909 20379 19542 36402 25268 21093 18592 12625 12821
8388 22498 17592 17006 23354 19577 18854 12489 9400 9657
16777 8906 8697 8781 8884 8841 8844 4433 4217 4440
33554 8848 8684 8606 8877 8436 8843 4412 4293 4422
67108 8423 8445 8433 8685 8506 8526 4228 4296 4273
134217 8704 8453 8572 8563 8426 8485 4383 4303 4346
268435 8623 8579 8539 8731 8652 8612 4408 4301 4322
536870 8683 8331 8534 8724 8658 8444 4371 4330 4325
MP Memory Bus Speed
MP Bus Speeds 32 bit Version 1.0, 4 Threads, Fri Jun 17 16:44:21 2011
Kbytes Inc32wds Inc16wds Inc8wds Inc4wds Inc2wds ReadAll 128bSSE2
6 3901 7614 14703 28644 29313 34882 74424
24 7466 14648 28660 29468 37750 40926 79860
96 4648 5085 8422 19230 33948 39486 74050
384 4774 5131 9864 19142 32406 41067 82021
768 2726 2746 5361 9874 17152 30193 42259
1536 2407 2543 4943 10058 17570 29261 41159
16380 812 837 1684 3635 6772 12743 16252
131070 786 813 1605 3444 6259 12161 14950
393210 807 855 1649 3333 6234 11625 14892
MP Memory Random Access
RandMemMP Speeds 64 Bit Version 1, 4 Threads, Sun Jun 26 18:00:21 2011
------------------ MBytes Per Second At --------------------
6 KB 24 KB 96 KB 384 KB 768 KB 1536 KB 12 MB 96 MB
Serial RD 29630 53166 44120 44829 29620 29671 12108 11987
Serial RW 5040 7334 7442 7402 7353 7395 8532 6247
Random RD 28388 41211 27807 12265 8866 6611 2103 1271
Random RW 657 1096 1229 1283 1288 1376 1648 993
Mutex SRW 5962 8654 7998 7882 6982 6853 3579 3415
Mutex RRW 6243 8594 5838 2815 1970 1370 486 310
|
To Start
Core i7 Multithreading Benchmarks
This is a quad core/8 thread 3.7 GHz Core i7 4820K with 10 MB L3 cache, normally running at Turbo Burst speed of 3.9 GHz. It has 4 memory channels with maximum speed of 800 MHz (bus speed) x 2 (DDR) x 4 (channels) x 8 (bus width) or 51.2 GB/second.
Simple Add Tests - See also Maximum CPU Speed Tests, where the stand alone speeds are slightly faster than those for single threads. It also seems that, for these particular code sequences, eight threads are required for near a four times performance improvement, where throughput is 12.2 MIPS/MHz and 15.8 MFLOPS/MHz.
Whetstone MP Benchmark - The single core version of this benchmark does not use pipelines very efficiently but, using 8 threads, performance of MFLOPS test is increased by 7.8 times, but 4 to 5 times on integer routines.
MP MFLOPS Benchmark - This used the same basic C code an OpenMP variety. See comparisons above. Note that there is a second version, compiled to use AVX instructions. Maximum speed of one core, with linked multiply and add, is 31.2, using SSE instructions, and twice that with AVX. With 4 cores, SSE and AVX maximum GFLOPS are 124.8 and 249.6, with 75% and 71% of these being demonstrated.
MP BusSpeed - This did not benefit by running via 8 threads, compared with four. Measured maximum RAM speed was greater than the 51.2 GB/second specification. This was due to all threads reading the same data and the 10 MB shared cache. A new version was produced, to minimise the effect, with threads starting reading from different addresses, still in the same data array, reducing maximum speed to 40 GB/second or less.
MP MemSpeed - This firstly shows single and double precision multiply + add tests, using one and eight threads, with normal 64 bit compilation and, again, with AVX options, then with one thread for a 32 bit Operating System.
There are some start up overheads, providing slower performance than MemSpeed Benchmark above, using one thread, but, as each thread handles a unique segment of data, cache flushing is minimised with multiple threads.
The benchmarks’ assembly code listings show that full SIMD SSE and AVX instructions are used but, possibly because of compiling for multiple threads, there are excessive numbers of addition instructions generated. This leads to some slower speeds that OpenMP MemSpeed and SSE/SSE2 being faster than AVX.
The additional results, for the second tests with just addition, show that the compiled code is much better, with SSE/SSE2 speeds similar to MemSpeed via OpenMP and AVX instructions providing appropriate performance gains. Then, none of these GFLOPS speeds are close to the maximum potential of 31.2 single precision GFLOPS with SSE and double using AVX instructions (half these with double precision).
MP Random Access Benchmark - As expected, multithreading performance can be worse than using a single thread, when write back to memory is used, but reasonable performance and improvements were possible with data in the large L3 cache. Using Mutex restrictions lead to no real gains using multi-threading.
Simple Add Tests
Multithreading Add Test 64 bit Version 1.0 Sat Nov 8 12:16:25 2014
Integer Additions 8 Threads
Thread 3 - 6318 64 bit Integer MIPS
Thread 5 - 6307 64 bit Integer MIPS
Thread 2 - 6241 64 bit Integer MIPS
Thread 6 - 6212 64 bit Integer MIPS
Thread 7 - 6124 64 bit Integer MIPS
Thread 4 - 6036 64 bit Integer MIPS
Thread 8 - 6001 64 bit Integer MIPS
Thread 1 - 5923 64 bit Integer MIPS
Total - 49162 64 Bit Integer MIPS
Aggregate - 47387 64 Bit Integer MIPS, based on last to finish
SSE Floating Point Additions 8 Threads
Thread 7 - 7767 32 Bit SSE MFLOPS
Thread 8 - 7765 32 Bit SSE MFLOPS
Thread 3 - 7752 32 Bit SSE MFLOPS
Thread 4 - 7749 32 Bit SSE MFLOPS
Thread 5 - 7738 32 Bit SSE MFLOPS
Thread 2 - 7727 32 Bit SSE MFLOPS
Thread 1 - 7725 32 Bit SSE MFLOPS
Thread 6 - 7693 32 Bit SSE MFLOPS
Total - 61916 32 Bit SSE MFLOPS
Aggregate - 61540 32 Bit SSE MFLOPS, based on last to finish
Single Thread 11937 64 Bit Integer MIPS
15450 32 Bit SSE MFLOPS
Two Threads 23069 64 Bit Integer MIPS
30887 32 Bit SSE MFLOPS
Four Theads 24717 64 Bit Integer MIPS
24167 64 Bit Integer MIPS, based on last to finish
46409 32 Bit SSE MFLOPS
30903 32 Bit SSE MFLOPS, based on last to finish
Whetstone MP Benchmark
Multithreading Double Precision Whetstones 64-Bit Version 1.0
Using 8 threads - Sat Nov 8 14:58:12 2014
MWIPS MFLOPS MFLOPS MFLOPS Cos Exp Fixpt If Equal
Thread 1 2 3 MOPS MOPS MOPS MOPS MOPS
1 3828 1321 1320 959 92 62 3156 2963 629
2 3803 1270 1321 952 92 61 3155 2930 628
3 3811 1315 1282 956 92 61 3125 2990 630
4 3807 1259 1280 952 92 62 3145 2958 629
5 3821 1286 1287 961 92 62 3087 2926 629
6 3815 1283 1284 962 91 62 3134 2933 629
7 3818 1300 1306 956 92 62 3135 2929 629
8 3821 1286 1304 958 92 62 3143 2931 629
Total 30524 10321 10384 7657 733 494 25079 23559 5033
Total
1 Thrd 4648 1331 1331 977 122 70 4720 5855 983
2 Thrd 9274 2661 2660 1945 243 140 9769 11717 1964
4 Thrd 18078 5263 5229 3907 488 265 15620 17408 3929
MP MFLOPS Benchmark
MFLOPS 1 to 8 Threads
4 Byte Ops/ Repeat SSE ------ SSE ------ ------ AVX ------
Words Word Passes 1 CPU 1 4 8 1 4 8
100000 2 2500 9918 9681 45340 54621 12542 62273 60258
1000000 2 250 9688 9759 21688 41832 11404 23031 44329
10000000 2 25 5870 5990 9237 10026 5991 8970 9977
100000 8 2500 24448 24533 49320 92086 35982 159040 173224
1000000 8 250 24465 24570 49918 92352 36180 80096 151909
10000000 8 25 20055 19975 36638 39982 23299 40124 40153
100000 32 2500 23251 23269 46942 92408 46400 90572 173372
1000000 32 250 23265 23307 89676 93282 46572 91058 177831
10000000 32 25 23063 23052 91029 92050 44729 88877 158594
MP Memory Speed
x[m]=x[m]+s*y[m]
64b 1 Thread 64b 8 Thread 64b AVX 1 T 64b AVX 8 T 32b 1 Thread
KBytes Dble Sngl Dble Sngl Dble Sngl Dble Sngl Dble Sngl
Used MB/S MB/S MB/S MB/S MB/S MB/S MB/S MB/S MB/S MB/S
4 29668 15246 37397 22021 16828 10053 38396 31823 22323 11275 L1
8 30422 15420 52063 33134 16865 10130 46928 32871 22744 11317
16 30754 15503 69122 44818 16891 10136 53801 37870 22887 11340
32 30680 15459 98246 51419 16102 10134 66372 37707 22870 11324
64 28867 15292 103196 54739 16872 10132 68113 39620 22352 11281 L2
128 28955 15286 115996 53402 16895 10132 61264 36423 22359 11296
256 28741 15287 113644 60777 16785 10134 68244 40618 22165 11296
512 24664 15200 116243 60628 16580 10128 65631 37589 21408 11285 L3
1024 24662 15207 117177 57777 16620 10087 63796 37746 21288 11270
2048 24424 15207 95433 58470 16444 9827 64988 40739 21305 11268
4096 24408 14253 98608 57900 15592 9839 63209 36650 20958 11141
8192 24213 14940 99671 56541 15666 8823 67851 38623 20297 11030
16384 14983 11747 28689 28004 12310 9117 30911 28600 15179 10297 RAM
32768 14667 11464 25857 25885 12253 9098 24926 24294 15075 9576
65536 14523 11772 24875 24963 11968 9016 24070 22805 14547 9738
131072 14433 11570 24789 24833 12564 9180 23856 25190 15249 10246
262144 14266 11165 25525 24575 12529 8851 25236 22608 15273 10252
524288 14386 11824 25054 24707 12338 8931 24974 24490 15295 10268
1048576 14452 11468 25402 25735 11954 8972 24917 24153 15308 10278
2097152 14908 11769 25100 25402 12396 8901 24545 25061
4194304 14938 11916 24785 24556 12284 9007 24608 25285
Max GFLOPS 3.8 3.9 14.6 15.2 2.1 2.5 8.5 10.2 2.9 2.8
x[m]=x[m]+y[m]
64b 1 Thread 64b 8 Thread 64b AVX 1 T 64b AVX 8 T 32b 1 Thread
KBytes Dble Sngl Dble Sngl Dble Sngl Dble Sngl Dble Sngl
Used MB/S MB/S MB/S MB/S MB/S MB/S MB/S MB/S MB/S MB/S
16 41065 20688 82924 46075 61385 61280 116710 90819 27816 14030 L1
128 34323 20476 140036 76202 48299 47771 226979 230972 26712 13977 L2
8192 26045 19106 108046 80815 28005 27977 121758 113292 22607 13535 L3
131072 15644 14115 25675 25639 14893 14915 24319 25609 15862 12099 RAM
Max GFLOPS 2.6 2.6 8.8 10.1 3.8 7.7 14.2 28.9 1.7 1.8
MP Memory Bus Speed
MP Bus Speeds 64 bit Version 1.0, 4 Threads, Sun Nov 23 10:35:01 2014
Kbytes Inc32wds Inc16wds Inc8wds Inc4wds Inc2wds ReadAll 128bSSE2
6 76609 51101 75602 140546 104501 167163 205782 L1
24 120982 107268 113828 153185 170288 149892 248761
96 41962 40737 43299 73311 123250 160399 240730 L2
384 19664 20262 20831 38942 75517 128002 160495 L3
768 19242 19941 20676 39821 73897 127177 152781
1536 19103 19854 20683 39137 54701 127196 152980
16380 6210 6913 8363 14942 29204 56919 56522 RAM
131070 5901 6947 8368 15029 29096 51843 61776
393210 5909 5426 8370 12684 29097 58307 59609
1 Thread
6 31501 31266 31243 41117 36617 41277 61526
768 5303 5386 5499 10808 19429 33765 38337
131070 1229 1470 2054 4514 8754 18043 18094
MP Bus Speeds 64 bit Version 2.0, 4 Threads, Sun Nov 23 10:35:44 2014
Same as Version 1.0, except each thread starts at different address
Kbytes Inc32wds Inc16wds Inc8wds Inc4wds Inc2wds ReadAll 128bSSE2
6 28749 29616 58739 64451 61610 129160 231735
24 114043 117435 119746 108799 143160 163902 245756
96 39170 40423 42705 76442 110895 154667 240928
384 19631 20232 20793 40066 69429 126075 158417
768 19212 19923 20648 39748 72952 125329 151560
1536 19086 19296 20661 39791 73469 120135 152311
16380 5843 6857 8210 14523 27776 55150 59064
131070 2038 3108 5197 10201 20004 38092 40726
393210 2090 3101 5072 9867 19538 39489 39824
786420 2083 2943 5082 10133 20016 37592 40764
1572840 2025 3011 5091 10207 19039 39479 40781
1 Thread
6 31501 31266 31243 41117 36617 41277 61526
768 5303 5386 5499 10808 19429 33765 38337
131070 1226 1484 2096 4411 8462 18188 18382
MP Memory Random Access
RandMemMP Speeds 64 Bit Version 1, 8 Threads, Sat Nov 8 12:41:51 2014
------------------ MBytes Per Second At --------------------
6 KB 24 KB 96 KB 384 KB 768 KB 1536 KB 12 MB 96 MB
Serial RD 37112 77469 94806 94862 90795 86826 65882 56315
Serial RW 8924 29533 54380 47712 51176 69146 68008 22145
Random RD 36944 76814 62245 33838 24552 21588 13472 3341
Random RW 2000 6016 9058 17412 16237 16733 10066 2806
Mutex SRW 7829 16705 19723 16432 16331 16570 11550 10669
Mutex RRW 10672 20797 8933 5659 4844 4561 2659 940
RandMemMP Speeds 64 Bit Version 1, 1 Threads, Sat Nov 8 12:39:21 2014
Serial RD 28021 27808 20268 19318 19231 19255 12455 11589
Serial RW 29972 30232 21894 17867 17410 17420 12242 11581
Random RD 27479 27463 13595 8251 6228 5605 2470 1011
Random RW 30429 30076 9224 6120 5177 4782 2800 982
Mutex SRW 29987 30245 21895 17875 17419 17249 12373 11495
Mutex RRW 30417 30027 9199 6117 5175 4780 2796 982
|
To Start
Image Processing Benchmarks
SDL_bmpspd32 and SDL_bmpspd64 benchmarks execute the same tests as the Windows version, where details and results can be found in
bmpspeed results.htm.
They are 32 bit and 64 bit varieties compiled to run under Linux using Simple DirectMedia Layer (SDL) functions. The benchmarks generate BMP files and measure speed of saving, loading, scrolling, rotating and editing of 0.5, 1, 2, 4 etc. to 512 MB images.
The programs automatically adjust maximum image size used, depending on available main memory, but run time parameters can be used to change this.
The execution files, source code, compilation and running instructions can be found in
linux_image_processing_benchmarks.tar.gz
with further details in
linux image processing benchmarks.htm. Example results are below.
Besides the standard Configuration Details shown earlier, additional attributes, obtained for this benchmark, are determined and included in the following example results.
Hardware benchmarked for
the main report
were desktops, a laptop and a netbook using internal and external (eSATA) disk drives plus usb flash memory and disk drives. Linux versions used were 32-Bit and 64-Bit Ubuntu 10.10 with GNOME 2, 64-Bit Ubuntu 11.04 with Unity on two different graphics arrangements, 64-Bit Fedora 14 with GNOME 2 and 64-Bit OpenSuse 11.4 with KDE.
Result are also provided for the Core i7, with a USB 3.0 disk drive, plus faster CPUs, memory and graphics card.
Additional System Details
#####################################################################
2.4 GHz Core 2 Duo, eSATA disk
Memory stats from /proc/meminfo
MemTotal: 3963.8 MB A
MemFree: 3181.8 MB B
Buffers: 46.5 MB C
Cached: 297.5 MB D
Memory Used: 438.0 MB = A - B - C - D
Current Directory Path (getcwd) and drive space (statvfs):
/home/roy/all64/bmpspd
Total MB 11263, Free MB 9446, Used MB 1817
See files hd1.txt and hd2.txt for details of drive used
SDL_GetVideoInfo
hw_available flag is 0 - cannot create hardware surfaces
Display size 1280 x 1024 pixels at 32 bits
SDL_VideoDriverName = x11
Graphics (command - lspci | grep -i vga > vga.txt)
VGA compatible controller: nVidia Corporation G84 [GeForce 8600 GT] (rev a1)
#####################################################################
Image Editing Speeds 64 Bit Version 1, Sat Aug 6 09:45:47 2011
Input Enlarge Save Load Scroll Scroll Rotate Max MB
Image Display Display Repeat Overall 90 deg Memory
Mbytes Secs Secs Secs msecs MB/Sec Secs Used
0.5 0.02 0.01 0.01 0.83 601.15 0.01 440.2
1.0 0.02 0.05 0.02 1.63 612.30 0.02 441.9
2.0 0.02 0.02 0.03 3.31 634.52 0.02 445.4
4.0 0.03 0.04 0.06 5.66 625.44 0.03 451.6
8.0 0.05 0.08 0.11 6.73 584.70 0.05 464.7
16.0 0.09 0.16 0.20 6.77 580.53 0.08 489.5
32.0 0.16 0.29 0.31 6.70 587.05 0.16 541.1
64.0 0.29 0.59 0.71 6.94 566.85 0.32 672.4
128.0 0.59 1.32 1.22 6.64 592.54 0.65 785.3
256.0 1.14 2.35 2.60 6.63 593.46 3.51 1129.9
512.0 2.27 4.90 4.73 6.65 591.47 3.91 1822.9
#####################################################################
3.7 GHz Core 17 (3.9 GHz Turbo Boost), USB 3 disk
Memory stats from /proc/meminfo
MemTotal: 32114.1 MB A
MemFree: 30952.5 MB B
Buffers: 40.2 MB C
Cached: 376.1 MB D
Memory Used: 745.4 MB = A - B - C - D
Current Directory Path (getcwd) and drive space (statvfs):
/home/roy/benchmarks/Old/bmpspd/bin64
Total MB 446040, Free MB 435462, Used MB 10577
See files hd1.txt and hd2.txt for details of drive used
SDL_GetVideoInfo
hw_available flag is 0 - cannot create hardware surfaces
Display size 1920 x 1080 pixels at 32 bits
SDL_VideoDriverName = x11
Graphics (command - lspci | grep -i vga > vga.txt)
VGA compatible controller: NVIDIA Corporation GK107 [GeForce GTX 650] (rev a1)
#####################################################################
Image Editing Speeds 64 Bit Version 1, Sat Dec 27 09:58:41 2014
Input Enlarge Save Load Scroll Scroll Rotate Max MB
Image Display Display Repeat Overall 90 deg Memory
Mbytes Secs Secs Secs msecs MB/Sec Secs Used
0.5 0.01 0.01 0.02 0.65 774.44 0.00 751.4
1.0 0.01 0.11 0.01 1.04 957.30 0.01 752.2
2.0 0.02 0.01 0.03 1.87 1121.19 0.01 756.3
4.0 0.02 0.03 0.03 3.37 1108.22 0.02 763.0
8.0 0.03 0.05 0.15 4.72 1119.93 0.02 774.7
16.0 0.05 0.09 0.26 5.61 1108.62 0.04 800.4
32.0 0.06 0.31 0.51 5.02 1239.99 0.05 853.0
64.0 0.11 0.56 0.62 5.52 1126.91 0.12 983.3
128.0 0.20 1.32 1.28 5.87 1059.86 0.23 1095.7
256.0 0.38 2.78 2.67 5.86 1062.25 0.58 1443.1
512.0 0.74 5.42 5.07 6.35 979.01 0.83 2135.9
|
To Start
OpenGL Benchmark
The benchmarks, videogl32 and videogl64, are 32-Bit and 64-Bit Linux compilations of OpenGL code used for testing via Windows. Details and results can be found in
Linux OpenGL Benchmarks.htm.
The benchmarks measure graphics speed in terms of Frames Per Second (FPS) via six simple and more complex tests. The first four tests portray moving up and down a tunnel including various independently moving objects, with and without texturing. The last two tests, represent a real application for designing kitchens. The first is in wireframe format, drawn with 23,000 straight lines. The second has colours and textures applied to the surfaces.
The textures are obtained from 24 bit BMP files that can be up 256 x 256 pixels at 192 KB. The BMP files and Linux execution files can be found in
linux_opengl_benchmarks.tar.gz,
along with source code, compilation and running instructions. Windows benchmarks from the same source code are also included.
The benchmarks were run on a variety of Ubuntu, Fedora and OpenSuse distros and different PC hardware, with nVidia, ATI and Intel graphics. Newly installed Linux systems do not [so far] provide OpenGL hardware acceleration and, except for nVidia, finding such a driver that works with a particular release is seemingly impossible, in some cases.
As a default, the benchmark runs using a full screen window, but input parameters allow different sized windows to be used, via Terminal commands or a script file. Following are example log files from tests using a Core 2 Duo CPU and GeForce 8600 GT graphics, using a default driver and one from nVidia.
Decreasing performance, as the window size increases, suggests a graphics speed limitation, with constant performance indicating that processor speed is the limiting factor.
2014 results for the Core i7 system are also provided below, where speeds can all be twice those on the Core 2 Duo.
#####################################################################
Linux OpenGL Benchmark 64 Bit Version 1, Wed Oct 26 22:29:24 2011
Running Time Approximately 5 Seconds Each Test
Window Size Coloured Objects Textured Objects WireFrm Texture
Pixels Few All Few All Kitchen Kitchen
Wide High FPS FPS FPS FPS FPS FPS
320 240 221.7 158.1 162.4 109.3 72.1 48.0
640 480 60.9 53.5 46.2 37.6 52.7 22.2
1024 768 23.7 22.0 18.4 15.6 34.9 10.7
1280 1024 15.6 14.6 12.0 10.3 28.5 7.4
End at Wed Oct 26 22:31:38 2011
#####################################################################
Linux OpenGL Benchmark 64 Bit Version 1, Tue Oct 25 18:36:45 2011
Running Time Approximately 5 Seconds Each Test
Window Size Coloured Objects Textured Objects WireFrm Texture
Pixels Few All Few All Kitchen Kitchen
Wide High FPS FPS FPS FPS FPS FPS
320 240 3670.2 2326.6 1160.9 678.8 401.0 229.2
640 480 2463.1 2033.9 896.3 666.3 414.5 231.3
1024 768 1089.2 987.3 541.6 440.9 401.8 214.6
1280 1024 727.0 680.8 412.1 338.3 400.2 194.0
End at Tue Oct 25 18:38:58 2011
#####################################################################
3.7 GHz Core i7, Ubuntu 14.04, GeForce GTX 650
Linux OpenGL Benchmark 64 Bit Version 1, Fri Jan 2 11:16:35 2015
Running Time Approximately 5 Seconds Each Test
Window Size Coloured Objects Textured Objects WireFrm Texture
Pixels Few All Few All Kitchen Kitchen
Wide High FPS FPS FPS FPS FPS FPS
320 240 7488.1 4641.5 2094.8 1249.6 774.4 398.7
640 480 6630.8 5549.2 2217.1 1250.0 744.9 395.3
1024 768 3399.2 3174.1 1958.8 1195.9 655.6 342.4
1280 1024 2151.5 2075.0 1481.5 1158.0 762.3 376.7
1680 1050 1753.3 1692.3 1289.0 1036.6 696.7 361.3
1920 1080 1563.3 1512.1 1189.6 986.0 779.2 375.5
End at Fri Jan 2 11:19:54 2015
|
To Start
On-Line Benchmarks
A Java version of the Whetstone Classic Benchmark, that is executed via a downloaded HTML page, was produced in 1997.
Because of the timing considerations in those days, the benchmark ran for 100 seconds. It also included a measurement of graphics speed. Running this via FireFox and Linux identified some unacceptable text displays and measured speeds, due to over-optimisation. The code was modified slightly to avoid this, running time was reduced and graphics tests were excluded, for a new version, compiled via Java installed under Linux.
The benchmark is run via
WhetJava2.html
or indirectly from
online benchmarks.html,
which also includes tests to measure downloading speed of images (see below).
Performance results are produced in graphics format, but this can be kept using Take ScreenShot. A version of the new benchmark was also compiled, that runs from a Terminal command, to produce text output to the window and log file. Format is the same as the graphics display and an example is given below.
Results via Linux and Windows are available in
Whetstone Benchmark Results - Java.
These show differences in 32 bit vs 64 bit, Windows vs Linux, On-line vs Off-line and same results with different browsers. The benchmarks, including source code, can be downloaded from
onlinetests.zip
or
onlinetests.tar.gz.
*************************************************************
3.0 GHz Phenom
Whetstone Benchmark Java Version, Dec 8 2011, 23:38:14
1 Pass
Test Result MFLOPS MOPS millisecs
N1 floating point -1.124750137 894.69 0.0215
N2 floating point -1.131330490 732.82 0.1834
N3 if then else 1.000000000 1027.81 0.1007
N4 fixed point 12.000000000 1735.54 0.1815
N5 sin,cos etc. 0.499110132 41.15 2.0220
N6 floating point 0.999999821 496.69 1.0860
N7 assignments 3.000000000 582.23 0.3174
N8 exp,sqrt etc. 0.825148463 33.54 1.1090
MWIPS 1991.45 5.0215
Operating System Linux, Arch. amd64, Version 2.6.34-12-desktop
Java Vendor Sun Microsystems Inc., Version 1.6.0_26
*************************************************************
3.7 GHz Core i7
Whetstone Benchmark Java Version, Jan 4 2015, 11:53:10
1 Pass
Test Result MFLOPS MOPS millisecs
N1 floating point -1.124750137 1280.00 0.0150
N2 floating point -1.131330490 1150.68 0.1168
N3 if then else 1.000000000 1358.98 0.0762
N4 fixed point 12.000000000 3118.81 0.1010
N5 sin,cos etc. 0.499110132 73.76 1.1280
N6 floating point 0.999999821 658.45 0.8192
N7 assignments 3.000000000 1133.74 0.1630
N8 exp,sqrt etc. 0.935364604 46.60 0.7982
MWIPS 3108.14 3.2174
Operating System Linux, Arch. amd64, Version 3.13.0-24-generic
Java Vendor Oracle Corporation, Version 1.8.0_25
|
Online Benchmark Downloading Tests measure the downloading time of 1 MByte or 100 KByte BMP, GIF and JPG files and for 200 or 400 70 Byte GIF files. Of particular note, typical loading times of the 400 GIFs (28 KB) is twice as long as that for the 1 MB image files.
To Start
JavaDraw Benchmarks
Versions (.class files) compiled with Java JDK 6 and 7 are available to execute off-line, via a terminal command, and on-line, using a browser. There are two benchmarks for each of these, the original (Swing) and, to avoid Windows issues with this, a new version (AWT). For details and results see
JavaDraw.htm.
Java source codes, class files and images used are in:
Java PC Benchmarks.zip.
As shown in the example results below, the benchmark has five test procedures with increasing activity, each one running for 10 seconds, and the first one repeated to identify start up overheads. Note that the benchmark is designed to measure speed and displays might have flashing and missing objects, particularly with the on-line versions. The latter requires specific permissions to execute, with IcedTea JRE appearing to be the only one that enables this ability under Ubuntu (14.04).
Following the example output are a series of results on the 3.7 GHz Core i7 with a GeForce GTX 650 graphics card, running under Ubuntu 14.04, including comparisons with the Java code compiled using JDK 8 and running via JRE 1.8.
It can be seen that, compiling with JDK 7 and 8, leads to similar speed, running via JRE 1.8. Then, running via JRE 1.7 produced a completely different performance profile. The much faster JRE 1.8 performance, with the lighter loading, appears to be associated in a higher level of multithreading but, this also applies with the heaviest loading, suggesting different graphics processor utilisation or CPU to GPU communication.
Further results are available using other processors and Windows.
******************************************************************
Java AWT Drawing Benchmark, Jan 5 2015, 10:32:15
Produced by javac 1.8.0_25
Test Frames FPS
Display PNG Bitmap Twice Pass 1 19201 1920.10
Display PNG Bitmap Twice Pass 2 20826 2082.60
Plus 2 SweepGradient Circles 20478 2047.80
Plus 200 Random Small Circles 9620 962.00
Plus 320 Long Lines 3830 383.00
Plus 4000 Random Small Circles 435 43.30
Total Elapsed Time 60.1 seconds
Operating System Linux, Arch. amd64, Version 3.13.0-24-generic
Java Vendor Oracle Corporation, Version 1.8.0_25
******************************************************************
On-line ----- Off-line ------
JDK Compiler 7 7 7 8
JRE 1.7 1.7 1.8 1.8
PNG Bitmaps 1 984 779 1971 1920
PNG Bitmaps 2 1006 979 2032 2083
+ SweepGradient Circle 485 453 1923 2048
+ 200 Small Circles 474 403 909 962
+ 320 Long Lines 412 307 312 383
+ 4000 Small Circles 306 219 41 43
|
To Start
Booting Time
Below are booting times on two PCs, from boot menu selection to loaded desktop. The two PCs are a Netbook with a 1.66 GHz Atom CPU, originally running Windows XP, and a desktop PC with a 2.4 GHz Core 2 Duo and Windows Vista. Besides seconds to boot, MB/second reading speed of the drives is provided, derived from the Image Processing Benchmark results.
The first results show Windows booting time, for comparison purposes, the Core 2 Duo being particularly slow. The second and fastest results are for 64-Bit Ubuntu 10.10, booting from the Windows disk in the Netbook, and a fast (for 2009) eSATA disk on the desktop.
Figures for the next six entries are from USB sticks, booting 32-Bit and 64-Bit Ubuntu 10.10, 64-Bit Ubuntu 11.04, 64-Bit Fedora 14 and 64-Bit OpenSuse 11.4.
On moving the drives between systems, it seems that booting time of the next system used can be considerably longer than normal (needs to use alternative drivers?). Also, the first Linux installations were with Ubuntu and nVidia drivers were installed in order to run CUDA based benchmarks, probably the reason why these would only fully boot on using Recovery Mode on the Netbook, with its Intel graphics.
On the desktop, all Linux loading times are faster than Windows, using much slower drives, but the fastest flash drive does not necessarily produce the shortest booting time. Repeating the tests for a number of times indicates that booting time depends on differing hardware/distro combinations. The last result is with OpenSuse on a USB disk drive, where the faster data transfer speed, compared to a flash drive, does not improve booting time much.
Later results, loading Ubuntu 14.04, are for the 37 GHz Core i7, using a USB 3.0 Seagate Expansion STBX1000101 disk drive and a cheap USB 3.0 Lexar Flash Drive, plus a WD CAVIAR BLACK WD1003FZEX SATA disk for Windows 8.1. All booting times are after a BIOS based menu that takes around 20 seconds to appear after switch on.
Netbook, WinXP, 5400 Desktop, Vista 7200 RPM
RPM Local Disk SATA and eSATA Disks
Drive Linux Boot1 Boot2 Disk Mode Boot1 Boot2 Disk Mode
Secs Secs MB/s Secs Secs MB/s
Windows Disk 64 50 70.0 Norm 170 170 47.8 Norm
Local Disk Ubuntu 10.10 37 35 56.0 Norm 22 23 108.0 Norm
Old Staples Ubuntu 10.10 100 66 9.3 Rec 76 71 8.8 Norm
4 GB Stick 64 Bit 95 71 Rec
PNY Attache Ubuntu 10.10 100 77 18.2 Rec 103 62 20.4 Norm
4 GB Stick 32 Bit
Cruzer U3 Ubuntu 10.10 50 51 16.4 Rec 57 57 16.9 Norm
4 GB Stick 64 Bit
Patriot Rage Ubuntu 11.04 46 57 24.3 Norm 76 48 26.8 Norm
8 GB Stick 64 Bit
Cruzer U3 Fedora 14 110 98 22.0 Norm 73 70 23.8 Norm
16 GB 64 Bit
Cruzer Blade OpenSuse 11.4 82 70 19.1 Norm 70 44 20.8 Norm
8 GB Stick 64 Bit
USB Disk OpenSuse 11.4 59 60 28.4 Norm 48 42 34.8 Norm
64 Bit
Rec = Recovery Mode
################################################################################
Desktop 3.7 GHz Core i7
Drive Linux Boot1 Boot2 Disk Mode
Secs Secs MB/s
Windows 8.1 Disk 57 58 139 Norm
USB 3 Disk Ubuntu 14.04 32 32 112 Norm
USB 3 Flash Ubuntu 14.04 26 26 94 Norm
|
To Start
Roy Longbottom January 2015
The Official Internet Home for my PC Benchmarks is via the link
Roy Longbottom's PC Benchmark Collection
|